--- title: "Transform and Validate JSON and NDJSON" author: - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY, US - name: Marcel Ramos affiliation: CUNY School of Public Health at Hunter College, New York, NY, US output: BiocStyle::html_document vignette: | %\VignetteIndexEntry{Transform and Validate JSON and NDJSON} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE ) ``` # Introduction & installation Use [rjsoncons][] for querying, transforming, and searching JSON, NDJSON, or R objects using [JMESpath][], [JSONpath][], or [JSONpointer][]. [rjsoncons][] supports [JSON patch][] for document editing, and [JSON schema][] validation. Link to the package for direct access to additional features in the [jsoncons][] C++ library. [jsoncons]: https://github.com/danielaparker/jsoncons/ [rjsoncons]: https://mtmorgan.github.io/rjsoncons/ [JSONpath]: https://goessner.net/articles/JsonPath/ [JMESpath]: https://jmespath.org/ [JSONpointer]: https://datatracker.ietf.org/doc/html/rfc6901 [JSON patch]: https://jsonpatch.com/ [JSON schema]: https://json-schema.org/ Install the released package version from CRAN ```{r install, eval = FALSE} install.packages("rjsoncons", repos = "https://CRAN.R-project.org") ``` Install the development version with ```{r install_github, eval = FALSE} if (!requireNamespace("remotes", quiety = TRUE)) install.packages("remotes", repos = "https://CRAN.R-project.org") remotes::install_github("mtmorgan/rjsoncons") ``` Attach the installed package to your *R* session, and check the version of the C++ library in use ```{r library, messages = FALSE} library(rjsoncons) rjsoncons::version() ``` # Query and pivot Functions in this package work on JSON or NDJSON character vectors, file paths and URLs to JSON or NDJSON documents, and *R* objects that can be transformed to a JSON string. ## Select, filter and transform with `j_query()` Here is a simple JSON example document ```{r json_example} json <- '{ "locations": [ {"name": "Seattle", "state": "WA"}, {"name": "New York", "state": "NY"}, {"name": "Bellevue", "state": "WA"}, {"name": "Olympia", "state": "WA"} ] }' ``` There are several common use cases. Use [rjsoncons][] to query the JSON string using [JSONpath][], [JMESpath][] or [JSONpointer][] syntax to filter larger documents to records of interest, e.g., only cities in New York state, using 'JMESpath' syntax. ```{r j_query} j_query(json, "locations[?state == 'NY']") |> cat("\n") ``` Use the `as = "R"` argument to extract deeply nested elements as *R* objects, e.g., a character vector of city names in Washington state. ```{r as_arg} j_query(json, "locations[?state == 'WA'].name", as = "R") ``` The JSON Pointer specification is simpler, indexing a single object in the document. JSON arrays are 0-based. ```{r jsonpointer} j_query(json, "/locations/0/state") ``` The examples above use `j_query()`, which automatically infers query specification from the form of `path` using `j_path_type()`. It may be useful to indicate query specification more explicitly using `jsonpointer()`, `jsonpath()`, or `jmespath()`; examples illustrating features available for each query specification are on the help pages `?jsonpointer`, `?jsonpath`, and `?jmespath`. ## Array-of-objects to *R* data.frame with `j_pivot()` The following transforms a nested JSON document into a format that can be incorporated directly in *R* as a `data.frame`. ```{r array_of_objects} path <- '{ name: locations[].name, state: locations[].state }' j_query(json, path, as = "R") |> data.frame() ``` The transformation from JSON 'array-of-objects' to 'object-of-arrays' suitable for direct representation as a `data.frame` is common, and is implemented directly as `j_pivot()` ```{r j_pivot} j_pivot(json, "locations", as = "data.frame") ``` `j_pivot()` also support `as = "tibble"` when the [dplyr][] package is installed. [purrr]: https://CRAN.R-project.org/package=purrr [dplyr]: https://CRAN.R-project.org/package=dplyr [tidyr]: https://CRAN.R-project.org/package=tidyr [r4ds]: https://r4ds.hadley.nz/ [hierarchical data]: https://r4ds.hadley.nz/rectangling ## NDJSON support [rjsoncons][] supports [NDJSON][] (new-line delimited JSON). NDJSON consists of a file or character vector where each line / element represents a JSON record. This example uses data from the [GitHub Archive][] project recording all actions on public GitHub repositories. The data included in the package are the first 10 lines of . ```{r} ndjson_file <- system.file(package = "rjsoncons", "extdata", "2023-02-08-0.json") ``` NDJSON can be read into R (`ndjson <- readLines(ndjson_file)`) and used in `j_query()` / `j_pivot()`, but it is often better to leave full NDJSON files on disk. Thus the first argument to `j_query()` or `j_pivot()` is usually a (text or gz-compressed) file path or URL. Two additional options are available when working with NDJSON. `n_records` limits the number of records processed. Using `n_records` can be very useful when exploring the data. For instance, the first record of a file can be viewed interactively with ```{r ndjson_listviewer, eval = FALSE} j_query(ndjson_file, n_records = 1) |> listviewer::jsonedit() ``` The option `verbose = TRUE` adds a progress indicator, which provides confidence that progress is being made while parsing large files. The progress bar requires the [cli][] package. `j_query()` provides a one-to-one mapping of NDJSON lines / elements to the return value, e.g., `j_query(ndjson_file, "@", as = "string")` on an NDJSON file with 1000 lines will return a character vector of 1000 elements, or with `j_query(ndjson, "@", as = "R")` an *R* list with length 1000. ```{r ndjson_j_query} j_query(ndjson_file, "{id: id, type: type}", n_records = 5) ``` `j_pivot()` transforms an NDJSON file or character vector of objects into a format convenient for input in *R*. `j_pivot()` with NDJSON files and JMESpath paths work particularly well together, because JMESpath provides flexibility in creating JSON objects to be pivoted. ```{r ndjson_j_pivot} j_pivot(ndjson_file, "{id: id, type: type}", as = "data.frame") ``` Filtering NDJSON files can require relatively more complicated paths, e.g., to filter 'PushEvent' types from organizations, construct a query that acts on each NDJSON record to return an array of a single object, then apply a filter to replace uninteresting elements with 0-length arrays (using `as = "tibble"` often transforms the *R* list-of-vectors to a tibble in a more pleasing and robust manner compared to `as = "data.frame"`). ```{r ndjson_j_pivot_filter} path <- "[{id: id, type: type, org: org}] [?@.type == 'PushEvent' && @.org != null] | [0]" j_pivot(ndjson_file, path, as = "data.frame") ``` A more complete example is used in the [NDJSON extended vignette][ndjson-extended] [NDJSON]: https://ndjson.org/ [GitHub Archive]: https://www.gharchive.org/ [ndjson-extended]: https://mtmorgan.github.io/rjsoncons/articles/b_ndjson_extended.html [cli]: https://CRAN.R-project.org/package=cli ## *R* objects as input [rjsoncons][] can filter and transform _R_ objects. These are converted to JSON using `jsonlite::toJSON()` before queries are made; `toJSON()` arguments like `auto_unbox = TRUE` can be added to the function call. ```{r r_list} ## `lst` is an *R* list lst <- jsonlite::fromJSON(json, simplifyVector = FALSE) j_query(lst, "locations[?state == 'WA'].name | sort(@)", auto_unbox = TRUE) |> cat("\n") ``` # Patch [JSON Patch][] provides a simple way to edit or transform a JSON document using JSON commands. ## Applying a patch with `j_patch_apply()` Starting with the JSON document ```{r} json <- '{ "biscuits": [ { "name": "Digestive" }, { "name": "Choco Leibniz" } ] }' ``` one can `"add"` another biscuit, and copy a favorite biscuit to a new locations using the following patch ```{r} patch <- '[ {"op": "add", "path": "/biscuits/1", "value": { "name": "Ginger Nut" }}, {"op": "copy", "from": "/biscuits/2", "path": "/best_biscuit"} ]' ``` The paths are specified using [JSONpointer][] notation; remember that JSON arrays are 0-based, compared to 1-based *R* arrays. Applying the patch results in a new JSON document. ```{r} j_patch_apply(json, patch) ``` Patches can also be created from *R* objects with the helper function `j_patch_op()`. ```{r} ops <- c( j_patch_op( "add", "/biscuits/1", value = list(name = "Ginger Nut"), auto_unbox = TRUE ), j_patch_op("copy", "/best_biscuit", from = "/biscuits/2") ) identical(j_patch_apply(json, patch), j_patch_apply(json, ops)) ``` `j_patch_op()` takes care of unboxing `op=`, `path=`, and `from=`, but some care must be taken in 'unboxing' the `value=` argument for operations such as 'add'; it may also be appropriate to unbox only specific fields, e.g., ```{r} value <- list(name = jsonlite::unbox("Ginger Nut")) j_patch_op("add", "/biscuits/1", value = value) ``` From the [JSON patch][] web site, available operations and example JSON are: - `add` -- add elements to an existing document. ``` {"op": "add", "path": "/biscuits/1", "value": {"name": "Ginger Nut"}} ``` - `remove` -- remove elements from a document. ``` {"op": "remove", "path": "/biscuits/0"} ``` - `replace` -- replace one element with another ``` { "op": "replace", "path": "/biscuits/0/name", "value": "Chocolate Digestive" } ``` - `copy` -- copy a path to another location. ``` {"op": "copy", "from": "/biscuits/0", "path": "/best_biscuit"} ``` - `move` -- move a path to another location. ``` {"op": "move", "from": "/biscuits", "path": "/cookies"} ``` - `test` -- test for the existence of a path; if the path does not exist, do not apply any of the patch. ``` {"op": "test", "path": "/best_biscuit/name", "value": "Choco Leibniz"} ``` Formal description of these operations is provided in Section 4 of [RFC6902][]. A patch command is *always* an array, even when a single operation is involved. ## Difference between documents with `j_patch_from()` The `j_patch_from()` function constructs a patch from the difference between two documents ```{r} j_patch_from(j_patch_apply(json, patch), json) ``` [JSON Patch]: https://jsonpatch.com/ [RFC6902]: https://datatracker.ietf.org/doc/html/rfc6902/#section-4 # Schema validation [JSON schema][] provides structure to JSON documents. `j_schema_is_valid()` checks that a JSON document is valid against a specified schema, and `j_schema_validate()` tries to illustrate how a document deviates from the schema. As an example consider `j_patch_op()`, where the operation is supposed to conform to the [JSON patch][] schema. For convenience, a copy of this schema is available in [rjsoncons][]. ```{r} ## alternatively: schema <- "https://json.schemastore.org/json-patch" schema <- system.file(package = "rjsoncons", "extdata", "json-patch.json") cat(readLines(schema), sep = "\n") ``` The well-formed 'op' is valid, and `j_schema_validate()` produces no output ```{r valid-schema} op <- '[{ "op": "add", "path": "/biscuits/1", "value": { "name": "Ginger Nut" } }]' j_schema_is_valid(op, schema) j_schema_validate(op, schema) ``` Introduce an invalid 'op', `"op": "invalid_op"`, and the schema is no longer valid. ```{r invalid-schema} op <- '[{ "op": "invalid_op", "path": "/biscuits/1", "value": { "name": "Ginger Nut" } }]' j_schema_is_valid(op, schema) ``` The reason can be understood from (careful!) consideration of the output of `j_schema_validate()`, with reference to the schema itself. ```{r invalid-schema-tibble} j_schema_validate(op, schema, as = "tibble") |> tibble::glimpse() ``` The validation indicates that the schema `evaluationPath` '/items/oneOf' is not satisfied, because of the `error` 'No schema [i.e., 'oneOf' elements] matched, ...'. The 'details' column summarizes why each of the 3 elements of `/items/oneOf` fails the schema specification; use `as = "details"` to extract this directly ```{r invalid-schema-details} j_schema_validate(op, schema, as = "details") |> tibble::glimpse() ``` This indicates that the first item in the schema is rejected because 'invalid_op' is not a valid enum ```{r invalid-schema-0} j_query(schema, "/items/oneOf/0/properties/op/enum") |> noquote() ``` Reasons for rejecting other items can be explored using similar steps. # Flatten and find It can sometimes be helpful to explore JSON documents by 'flattening' the JSON to an object of path / value pairs, where the path is the [JSONpointer][] path to the corresponding value. It is then straight-forward to search this flattened object for, e.g., the path to a known field or value. As an example, consider the object ```{r} codes <- '{ "discards": { "1000": "Record does not exist", "1004": "Queue limit exceeded", "1010": "Discarding timed-out partial msg" }, "warnings": { "0": "Phone number missing country code", "1": "State code missing", "2": "Zip code missing" } }' ``` The 'flat' JSON of this can be represented as named list (using `str()` to provide a compact visual representation) ```{r} j_flatten(codes, as = "R") |> str() ``` The names of the list are JSONpointer (default) or JSONpath, so can be used in `j_query()` and `j_pivot()` as appropriate ```{r} j_query(codes, "/discards/1010") ``` There are two ways to find known keys and values. The first is to use exact matching to one or more keys or values, e.g., ```{r} j_find_values( codes, c("Record does not exist", "State code missing"), as = "tibble" ) j_find_keys(codes, "warnings", as = "tibble") ``` It is also possible to match using a regular expression. ```{r} j_find_values_grep(codes, "missing", as = "tibble") j_find_keys_grep(codes, "card.*/100", as = "tibble") # span key delimiters ``` Keys are always character vectors, but values can be of different type; `j_find_values()` supports searches on these. ```{r} j <- '{"x":[1,[2, 3]],"y":{"a":4}}' j_flatten(j, as = "R") |> str() j_find_values(j, c(2, 4), as = "tibble") ``` A common operation might be to find the path to a know value, and then to query the original JSON to find the object in which the value is contained. ```{r} j_find_values(j, 3, as = "tibble") ## path to '3' is '/x/1/1', so containing object is at '/x/1' j_query(j, "/x/1") j_query(j, "/x/1", as = "R") ``` Both JSONpointer and JSONpath are supported; an advantage of the latter is that the path distinguishes between integer-valued (unquoted) and string-valued (quoted) keys ```{r} j_find_values(j, 3, as = "tibble", path_type = "JSONpath") ``` The first argument to `j_find_*()` can be an *R* object, JSON or NDJSON string, file, or URL. Using `j_find_values()` with an *R* object and JSONpath `path_type` leads to a path that is easily converted into an *R* index: double the `[` and `]` in the path and increment each numerical index by 1: ```{r} l <- j |> as_r() j_find_values(l, 3, auto_unbox = TRUE, path_type = "JSONpath", as = "tibble") l[['x']][[2]] # siblings ``` NDJSON files are flattened into character vectors, with each element the flattened version of the corresponding NDJSON record. # The JSON parser The package includes a JSON parser, used with the argument `as = "R"` or directly with `as_r()` ``` r as_r('{"a": 1.0, "b": [2, 3, 4]}') |> str() #> List of 2 #> $ a: num 1 #> $ b: int [1:3] 2 3 4 ``` The main rules of this transformation are outlined here. JSON arrays of a single type (boolean, integer, double, string) are transformed to *R* vectors of the same length and corresponding type. ```{r as_r} as_r('[true, false, true]') # boolean -> logical as_r('[1, 2, 3]') # integer -> integer as_r('[1.0, 2.0, 3.0]') # double -> numeric as_r('["a", "b", "c"]') # string -> character ``` JSON arrays mixing integer and double values are transformed to *R* numeric vectors. ```{r as_r_integer_numeric} as_r('[1, 2.0]') |> class() # numeric ``` If a JSON integer array contains a value larger than *R*'s 32-bit integer representation, the array is transformed to an *R* numeric vector. NOTE that this results in loss of precision for JSON integer values greater than `2^53`. ```{r as_r_64_bit} as_r('[1, 2147483648]') |> class() # 64-bit integers -> numeric ``` JSON objects are transformed to *R* named lists. ```{r as_r_objects} as_r('{}') as_r('{"a": 1.0, "b": [2, 3, 4]}') |> str() ``` There are several additional details. A JSON scalar and a JSON vector of length 1 are represented in the same way in *R*. ```{r as_r_scalars} identical(as_r("3.14"), as_r("[3.14]")) ``` JSON arrays mixing types other than integer and double are transformed to *R* lists ```{r as_r_mixed_arrays} as_r('[true, 1, "a"]') |> str() ``` JSON `null` values are represented as *R* `NULL` values; arrays of `null` are transformed to lists ```{r as_r_null} as_r('null') # NULL as_r('[null]') |> str() # list(NULL) as_r('[null, null]') |> str() # list(NULL, NULL) ``` Ordering of object members is controlled by the `object_names=` argument. The default preserves names as they appear in the JSON definition; use `"sort"` to sort names alphabetically. This argument is applied recursively. ```{r as_r_field_order} json <- '{"b": 1, "a": {"d": 2, "c": 3}}' as_r(json) |> str() as_r(json, object_names = "sort") |> str() ``` The parser corresponds approximately to `jsonlite::fromJSON()` with arguments `simplifyVector = TRUE, simplifyDataFrame = FALSE, simplifyMatrix = FALSE)`. Unit tests (using the [tinytest][] framework) providing additional details are available at ```{r as_r_tiny_test_source, eval = FALSE} system.file(package = "rjsoncons", "tinytest", "test_as_r.R") ``` [tinytest]: https://CRAN.R-project.org/package=tinytest ## Using `jsonlite::fromJSON()` The built-in parser can be replaced by alternative parsers by returning the query as a JSON string, e.g., using the `fromJSON()` in the [jsonlite][] package. ```{r jsonlite_fromJSON} json <- '{ "locations": [ {"name": "Seattle", "state": "WA"}, {"name": "New York", "state": "NY"}, {"name": "Bellevue", "state": "WA"}, {"name": "Olympia", "state": "WA"} ] }' j_query(json, "locations[?state == 'WA']") |> ## `fromJSON()` simplifies list-of-objects to data.frame jsonlite::fromJSON() ``` [jsonlite]: https://CRAN.R-project.org/package=jsonlite The [rjsoncons][] package is particularly useful when accessing elements that might otherwise require complicated application of nested `lapply()`, [purrr][] expressions, or [tidyr][] `unnest_*()` (see [R for Data Science][r4ds] chapter '[Hierarchical data][]'). # C++ library use in other packages The package includes the complete 'jsoncons' C++ header-only library, available to other R packages by adding ``` LinkingTo: rjsoncons SystemRequirements: C++11 ``` to the DESCRIPTION file. Typical use in an R package would also include `LinkingTo:` specifications for the [cpp11][] or [Rcpp][] (this package uses [cpp11][]) packages to provide a C / C++ interface between R and the C++ 'jsoncons' library. [cpp11]: https://cran.r-project.org/package=cpp11 [Rcpp]: https://cran.r-project.org/package=Rcpp # Session information This vignette was compiled using the following software versions ```{r session_info} sessionInfo() ```