Title: | NIH / NCI Genomic Data Commons Access |
---|---|
Description: | Programmatically access the NIH / NCI Genomic Data Commons RESTful service. |
Authors: | Martin Morgan [aut], Sean Davis [aut, cre], Marcel Ramos [ctb] |
Maintainer: | Sean Davis <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.31.0 |
Built: | 2024-10-29 22:18:13 UTC |
Source: | https://github.com/Bioconductor/GenomicDataCommons |
Programmatically access the NIH / NCI Genomic Data Commons RESTful service.
data
Maintainer: Sean Davis [email protected]
Authors:
Martin Morgan [email protected]
Other contributors:
Marcel Ramos [email protected] [contributor]
Useful links:
aggregations
aggregations(x) ## S3 method for class 'GDCQuery' aggregations(x) ## S3 method for class 'GDCResponse' aggregations(x)
aggregations(x) ## S3 method for class 'GDCQuery' aggregations(x) ## S3 method for class 'GDCResponse' aggregations(x)
x |
a |
a list
of data.frame
with one
member for each requested facet. The data frames
each have two columns, key and doc_count.
aggregations(GDCQuery)
:
aggregations(GDCResponse)
:
# Number of each file type res = files() |> facet(c('type','data_type')) |> aggregations() res$type
# Number of each file type res = files() |> facet(c('type','data_type')) |> aggregations() res$type
The GDC allows a shorthand for specifying groups
of fields to be returned by the metadata queries.
These can be specified in a select
method call to easily supply groups of fields.
available_expand(entity) ## S3 method for class 'character' available_expand(entity) ## S3 method for class 'GDCQuery' available_expand(entity)
available_expand(entity) ## S3 method for class 'character' available_expand(entity) ## S3 method for class 'GDCQuery' available_expand(entity)
entity |
Either a |
A character vector
See https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#expand for details
head(available_expand('files'))
head(available_expand('files'))
S3 Generic to return all GDC fields
available_fields(x) ## S3 method for class 'GDCQuery' available_fields(x) ## S3 method for class 'character' available_fields(x)
available_fields(x) ## S3 method for class 'GDCQuery' available_fields(x) ## S3 method for class 'character' available_fields(x)
x |
A character(1) string ('cases','files','projects',
'annotations') or an subclass of |
a character vector of the default fields
available_fields(GDCQuery)
: GDCQuery method
available_fields(character)
: character method
available_fields('projects') projQuery = query('projects') available_fields(projQuery)
available_fields('projects') projQuery = query('projects') available_fields(projQuery)
Find common values for a GDC field
available_values(entity, field)
available_values(entity, field)
entity |
character(1), a GDC entity ("cases", "files", "annotations", "projects") |
field |
character(1), a field that is present in the entity record |
character vector of the top 100 (or fewer) most frequent values for a the given field
available_values('files','cases.project.project_id')[1:5]
available_values('files','cases.project.project_id')[1:5]
GDCQuery
provide count of records in a GDCQuery
count(x, ...) ## S3 method for class 'GDCQuery' count(x, ...) ## S3 method for class 'GDCResponse' count(x, ...)
count(x, ...) ## S3 method for class 'GDCQuery' count(x, ...) ## S3 method for class 'GDCResponse' count(x, ...)
x |
a |
... |
passed to httr (good for passing config info, etc.) |
integer(1) representing the count of records that will be returned by the current query
count(GDCQuery)
:
count(GDCResponse)
:
# total number of projects projects() |> count() # total number of cases cases() |> count()
# total number of projects projects() |> count() # total number of cases cases() |> count()
S3 Generic to return default GDC fields
default_fields(x) ## S3 method for class 'character' default_fields(x) ## S3 method for class 'GDCQuery' default_fields(x)
default_fields(x) ## S3 method for class 'character' default_fields(x) ## S3 method for class 'GDCQuery' default_fields(x)
x |
A character string ('cases','files','projects',
'annotations') or an subclass of |
a character vector of the default fields
default_fields(character)
: character method
default_fields(GDCQuery)
: GDCQuery method
default_fields('projects') projQuery = query('projects') default_fields(projQuery)
default_fields('projects') projQuery = query('projects') default_fields(projQuery)
An "entity" is simply one of the four medata endpoints.
cases
projects
files
annotations
All GDCQuery
objects will have an entity name. This S3 method
is simply a utility accessor for those names.
entity_name(x) ## S3 method for class 'GDCQuery' entity_name(x) ## S3 method for class 'GDCResults' entity_name(x)
entity_name(x) ## S3 method for class 'GDCQuery' entity_name(x) ## S3 method for class 'GDCResults' entity_name(x)
x |
a |
character(1) name of an associated entity; one of "cases", "files", "projects", "annotations".
qcases = cases() qprojects = projects() entity_name(qcases) entity_name(qprojects)
qcases = cases() qprojects = projects() entity_name(qcases) entity_name(qprojects)
expand
parameterS3 generic to set GDCQuery expand parameter
expand(x, expand) ## S3 method for class 'GDCQuery' expand(x, expand)
expand(x, expand) ## S3 method for class 'GDCQuery' expand(x, expand)
x |
the objects on which to set fields |
expand |
a character vector specifying the fields |
A GDCQuery
object, with the expand
member altered.
expand(GDCQuery)
: set expand fields on a GDCQuery object
gProj = projects() gProj$fields head(available_fields(gProj)) default_fields(gProj) gProj |> select(default_fields(gProj)[1:2]) |> response() |> str(max_level=2)
gProj = projects() gProj$fields head(available_fields(gProj)) default_fields(gProj) gProj |> select(default_fields(gProj)[1:2]) |> response() |> str(max_level=2)
GDCQuery
Set facets for a GDCQuery
Get facets for a GDCQuery
facet(x, facets) get_facets(x) ## S3 method for class 'GDCQuery' get_facets(x)
facet(x, facets) get_facets(x) ## S3 method for class 'GDCQuery' get_facets(x)
x |
a |
facets |
a character vector of fields that
will be used for forming aggregations (facets).
Default is to set facets for all default fields.
See |
returns a GDCQuery
object,
with facets field updated.
# create a new GDCQuery against the projects endpoint gProj = projects() # default facets are NULL get_facets(gProj) # set facets and save result gProjFacet = facet(gProj) # check facets get_facets(gProjFacet) # and get a response, noting that # the aggregations list member contains # tibbles for each facet str(response(gProjFacet,size=2),max.level=2)
# create a new GDCQuery against the projects endpoint gProj = projects() # default facets are NULL get_facets(gProj) # set facets and save result gProjFacet = facet(gProj) # check facets get_facets(gProjFacet) # and get a response, noting that # the aggregations list member contains # tibbles for each facet str(response(gProjFacet,size=2),max.level=2)
S3 Generic that returns the field description text, if available
field_description(entity, field) ## S3 method for class 'GDCQuery' field_description(entity, field) ## S3 method for class 'character' field_description(entity, field)
field_description(entity, field) ## S3 method for class 'GDCQuery' field_description(entity, field) ## S3 method for class 'character' field_description(entity, field)
entity |
character(1) string ('cases','files','projects',
'annotations', etc.) or an subclass of |
field |
character(1), the name of the field that will be used to look up the description. |
character(1) descriptive text or character(0) if no description is available.
field_description(GDCQuery)
: GDCQuery method
field_description(character)
: character method
field_description('cases', 'annotations.category') casesQuery = query('cases') field_description(casesQuery, 'annotations.category') field_description(cases(), 'annotations.category')
field_description('cases', 'annotations.category') casesQuery = query('cases') field_description(casesQuery, 'annotations.category') field_description(cases(), 'annotations.category')
Manipulating GDCQuery filters
The filter
is simply a safe accessor for
the filter element in GDCQuery
objects.
The get_filter
is simply a safe accessor for
the filter element in GDCQuery
objects.
filter(x, expr) ## S3 method for class 'GDCQuery' filter(x, expr) get_filter(x) ## S3 method for class 'GDCQuery' get_filter(x)
filter(x, expr) ## S3 method for class 'GDCQuery' filter(x, expr) get_filter(x) ## S3 method for class 'GDCQuery' get_filter(x)
x |
the object on which to set the filter list member |
expr |
a filter expression in the form of
the right hand side of a formula, where bare names
(without quotes) are allowed if they are available
fields associated with the GDCQuery object, |
A GDCQuery
object with the filter
field replaced by specified filter expression
# make a GDCQuery object to start # # Projects # pQuery = projects() # check for the default fields # so that we can use one of them to build a filter default_fields(pQuery) pQuery = filter(pQuery,~ project_id == 'TCGA-LUAC') get_filter(pQuery) # # Files # fQuery = files() default_fields(fQuery) fQuery = filter(fQuery,~ data_format == 'VCF') # OR # with recent GenomicDataCommons versions: # no "~" needed fQuery = filter(fQuery, data_format == 'VCF') get_filter(fQuery) fQuery = filter(fQuery,~ data_format == 'VCF' & experimental_strategy == 'WXS' & type == 'simple_somatic_mutation') files() |> filter(~ data_format == 'VCF' & experimental_strategy=='WXS' & type == 'simple_somatic_mutation') |> count() files() |> filter( data_format == 'VCF' & experimental_strategy=='WXS' & type == 'simple_somatic_mutation') |> count() # Filters may be chained for the # equivalent query # # When chained, filters are combined with logical AND files() |> filter(~ data_format == 'VCF') |> filter(~ experimental_strategy == 'WXS') |> filter(~ type == 'simple_somatic_mutation') |> count() # OR files() |> filter( data_format == 'VCF') |> filter( experimental_strategy == 'WXS') |> filter( type == 'simple_somatic_mutation') |> count() # Use str() to get a cleaner picture str(get_filter(fQuery))
# make a GDCQuery object to start # # Projects # pQuery = projects() # check for the default fields # so that we can use one of them to build a filter default_fields(pQuery) pQuery = filter(pQuery,~ project_id == 'TCGA-LUAC') get_filter(pQuery) # # Files # fQuery = files() default_fields(fQuery) fQuery = filter(fQuery,~ data_format == 'VCF') # OR # with recent GenomicDataCommons versions: # no "~" needed fQuery = filter(fQuery, data_format == 'VCF') get_filter(fQuery) fQuery = filter(fQuery,~ data_format == 'VCF' & experimental_strategy == 'WXS' & type == 'simple_somatic_mutation') files() |> filter(~ data_format == 'VCF' & experimental_strategy=='WXS' & type == 'simple_somatic_mutation') |> count() files() |> filter( data_format == 'VCF' & experimental_strategy=='WXS' & type == 'simple_somatic_mutation') |> count() # Filters may be chained for the # equivalent query # # When chained, filters are combined with logical AND files() |> filter(~ data_format == 'VCF') |> filter(~ experimental_strategy == 'WXS') |> filter(~ type == 'simple_somatic_mutation') |> count() # OR files() |> filter( data_format == 'VCF') |> filter( experimental_strategy == 'WXS') |> filter( type == 'simple_somatic_mutation') |> count() # Use str() to get a cleaner picture str(get_filter(fQuery))
The GenomicDataCommons package will cache downloaded
files to minimize network and allow for
offline work. These functions are used to create a cache directory
if one does not exist, set a global option, and query that
option. The cache directory will default to the user "cache"
directory according to specifications in
app_dir
. However, the user may want to set
this to another direcotory with more or higher performance
storage.
gdc_cache() gdc_set_cache( directory = rappdirs::app_dir(appname = "GenomicDataCommons")$cache(), verbose = TRUE, create_without_asking = !interactive() )
gdc_cache() gdc_set_cache( directory = rappdirs::app_dir(appname = "GenomicDataCommons")$cache(), verbose = TRUE, create_without_asking = !interactive() )
directory |
character(1) directory path, will be created recursively if not present. |
verbose |
logical(1) whether or not to message the location of the cache directory after creation. |
create_without_asking |
logical(1) specifying whether to allow the function to create the cache directory without asking the user first. In an interactive session, if the cache directory does not exist, the user will be prompted before creation. |
The cache structure is currently just a directory with each file being represented by a path constructed as: CACHEDIR/UUID/FILENAME. The cached files can be manipulated using standard file system commands (removing, finding, etc.). In this sense, the cache sytem is minimalist in design.
character(1) directory path that serves as the base directory for GenomicDataCommons downloads.
the created directory (invisibly)
gdc_set_cache()
: (Re)set the GenomicDataCommons cache
directory
gdc_cache() ## Not run: gdc_set_cache(getwd()) ## End(Not run)
gdc_cache() ## Not run: gdc_set_cache(getwd()) ## End(Not run)
This function is a convenience function to find and return the path to the GDC Data Transfer Tool executable assumed to be named 'gdc-client'. The assumption is that the appropriate version of the GDC Data Transfer Tool is a separate download available from https://gdc.cancer.gov/access-data/gdc-data-transfer-tool and as a backup from https://github.com/NCI-GDC/gdc-client.
gdc_client()
gdc_client()
The path is checked in the following order:
an R option("gdc_client")
an environment variable GDC_CLIENT
from the search PATH
in the current working directory
character(1) the path to the gdc-client executable.
# this cannot run without first # downloading the GDC Data Transfer Tool gdc_client = try(gdc_client(),silent=TRUE)
# this cannot run without first # downloading the GDC Data Transfer Tool gdc_client = try(gdc_client(),silent=TRUE)
The NCI GDC has a complex data model that allows various studies to supply numerous clinical and demographic data elements. However, across all projects that enter the GDC, there are similarities. This function returns four data.frames associated with case_ids from the GDC.
gdc_clinical(case_ids, include_list_cols = FALSE)
gdc_clinical(case_ids, include_list_cols = FALSE)
case_ids |
a character() vector of case_ids, typically from "cases" query. |
include_list_cols |
logical(1), whether to include list columns in the "main" data.frame. These list columns have values for aliquots, samples, etc. While these may be useful for some situations, they are generally not that useful as clinical annotations. |
Note that these data.frames can, in general, have different numbers of rows (or even no rows at all). If one wishes to combine to produce a single data.frame, using the approach of left joining to the "main" data.frame will yield a useful combined data.frame. We do not do that directly given the potential for 1:many relationships. It is up to the user to determine what the best approach is for any given dataset.
A list of four data.frames:
main, representing basic case identification and metadata (update date, etc.)
diagnoses
esposures
demographic
case_ids = cases() |> results(size=10) |> ids() clinical_data = gdc_clinical(case_ids) # overview of clinical results class(clinical_data) names(clinical_data) sapply(clinical_data, class) sapply(clinical_data, nrow) # available data head(clinical_data$main) head(clinical_data$demographic) head(clinical_data$diagnoses) head(clinical_data$exposures)
case_ids = cases() |> results(size=10) |> ids() clinical_data = gdc_clinical(case_ids) # overview of clinical results class(clinical_data) names(clinical_data) sapply(clinical_data, class) sapply(clinical_data, nrow) # available data head(clinical_data$main) head(clinical_data$demographic) head(clinical_data$diagnoses) head(clinical_data$exposures)
The GDC requires an auth token for downloading data that are "controlled access". For example, BAM files for human datasets, germline variant calls, and SNP array raw data all are protected as "controlled access". For these files, a GDC access token is required. See the https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Authentication/#gdc-authentication-tokens. Note that this function simply returns a string value. It is possible to keep the GDC token in a variable in R or to pass a string directly to the appropriate parameter. This function is simply a convenience function for alternative approaches to get a token from an environment variable or a file.
gdc_token()
gdc_token()
This function will resolve locations of the GDC token in the following order:
from the environment variable, GDC_TOKEN
, expected to
contain the token downloaded from the GDC as a string
using readLines
to read a file named in the environment
variable, GDC_TOKEN_FILE
using readLines
to read from a file called .gdc_token
in the user's
home directory
If all of these fail, this function will return an error.
character(1) (invisibly, to protect against inadvertently printing) the GDC token.
https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Cart/#gdc-authentication-tokens
# This will not run before a GDC token # is in place. token = try(gdc_token(),silent=TRUE)
# This will not run before a GDC token # is in place. token = try(gdc_token(),silent=TRUE)
Download one or more files from GDC. Files are downloaded using the UUID and renamed to the file name on the remote system. By default, neither the uuid nor the file name on the remote system can exist.
gdcdata( uuids, use_cached = TRUE, progress = interactive(), token = NULL, access_method = "api", transfer_args = character(), ... )
gdcdata( uuids, use_cached = TRUE, progress = interactive(), token = NULL, access_method = "api", transfer_args = character(), ... )
uuids |
character() of GDC file UUIDs. |
use_cached |
logical(1) default TRUE indicating that, if found in the cache, the file will not be downloaded again. If FALSE, all supplied uuids will be re-downloaded. |
progress |
logical(1) default TRUE in interactive sessions, FALSE otherwise indicating whether a progress par should be produced for each file download. |
token |
(optional) character(1) security token allowing access to restricted data. See https://gdc-docs.nci.nih.gov/API/Users_Guide/Authentication_and_Authorization/. |
access_method |
character(1), either 'api' or 'client'. See details. |
transfer_args |
character(1), additional arguments to pass to
the gdc-client command line. See |
... |
further arguments passed to files |
This function is appropriate for one or several files; for large
downloads use manifest
to create a manifest for and
the GDC Data Transfer Tool.
When access_method is "api", the GDC "data" endpoint is the
transfer mechanism used. The alternative access_method, "client", will
utilize the gdc-client
transfer tool, which must be
downloaded separately and available. See
gdc_client
for details on specifying the location
of the gdc-client executable.
a named vector with file uuids as the names and paths as the value
manifest
for downloading large data.
# get some example file uuids uuids <- files() |> filter(~ access == 'open' & file_size < 100000) |> results(size = 3) |> ids() # and get the data, placing it into the gdc_cache() directory gdcdata(uuids, use_cached=TRUE)
# get some example file uuids uuids <- files() |> filter(~ access == 'open' & file_size < 100000) |> results(size = 3) |> ids() # and get the data, placing it into the gdc_cache() directory gdcdata(uuids, use_cached=TRUE)
This utility function allows quick text-based search of available
fields for using grep
grep_fields(entity, pattern, ..., value = TRUE)
grep_fields(entity, pattern, ..., value = TRUE)
entity |
one of the available gdc entities ('files','cases',...) against which to gather available fields for matching |
pattern |
A regular expression that will be used
in a call to |
... |
passed on to grep |
value |
logical(1) whether to return values as opposed to indices (passed along to grep) |
character() vector of field names matching
pattern
grep_fields('files','analysis')
grep_fields('files','analysis')
In many places in the GenomicDataCommons package, the entity ids are stored in a column or a vector with a specific name that corresponds to the field name at the GDC. The format is the entity name (singular) "_id". This generic simply returns that name from a given object.
id_field(x) ## S3 method for class 'GDCQuery' id_field(x) ## S3 method for class 'GDCResults' id_field(x)
id_field(x) ## S3 method for class 'GDCQuery' id_field(x) ## S3 method for class 'GDCResults' id_field(x)
x |
An object representing the query or results of an entity from the GDC ("cases", "files", "annotations", "projects") |
character(1) such as "case_id", "file_id", etc.
id_field(GDCQuery)
: GDCQuery method
id_field(GDCResults)
: GDCResults method
id_field(cases())
id_field(cases())
The GDC assigns ids (in the form of uuids) to objects in its database. Those ids can be used for relationships, searching on the website, and as unique ids. All
ids(x) ## S3 method for class 'GDCManifest' ids(x) ## S3 method for class 'GDCQuery' ids(x) ## S3 method for class 'GDCResults' ids(x) ## S3 method for class 'GDCResponse' ids(x)
ids(x) ## S3 method for class 'GDCManifest' ids(x) ## S3 method for class 'GDCQuery' ids(x) ## S3 method for class 'GDCResults' ids(x) ## S3 method for class 'GDCResponse' ids(x)
x |
A |
a character vector of all the entity ids
# use with a GDC query, in this case for "cases" ids(cases() |> filter(~ project.project_id == "TCGA-CHOL")) # also works for responses ids(response(files())) # and results ids(results(cases()))
# use with a GDC query, in this case for "cases" ids(cases() |> filter(~ project.project_id == "TCGA-CHOL")) # also works for responses ids(response(files())) # and results ids(results(cases()))
Searching the NCI GDC allows for complex filtering based on logical operations and simple comparisons. This function facilitates writing such filter expressions in R-like syntax with R code evaluation.
make_filter(expr, available_fields)
make_filter(expr, available_fields)
expr |
a lazy-wrapped expression or a formula RHS equivalent |
available_fields |
a character vector of the additional names that will be injected into the filter evaluation environment |
If used with available_fields, "bare" fields that are named in the available_fields character vector can be used in the filter expression without quotes.
a list
that represents an R version
of the JSON that will ultimately be used in an
NCI GDC search or other query.
The manifest
function/method creates a manifest of files to be downloaded
using the GDC Data Transfer Tool. There are methods for
creating manifest data frames from GDCQuery
objects
that contain file information ("cases" and "files" queries).
manifest(x, from = 0, size = count(x), ...) ## S3 method for class 'gdc_files' manifest(x, from = 0, size = count(x), ...) ## S3 method for class 'GDCfilesResponse' manifest(x, from = 0, size = count(x), ...) ## S3 method for class 'GDCcasesResponse' manifest(x, from = 0, size = count(x), ...)
manifest(x, from = 0, size = count(x), ...) ## S3 method for class 'gdc_files' manifest(x, from = 0, size = count(x), ...) ## S3 method for class 'GDCfilesResponse' manifest(x, from = 0, size = count(x), ...) ## S3 method for class 'GDCcasesResponse' manifest(x, from = 0, size = count(x), ...)
x |
An |
from |
Record number from which to start when returning the manifest. |
size |
The total number of records to return. Default will return the usually desirable full set of records. |
... |
passed to |
A tibble
, also of type "gdc_manifest", with five columns:
id
filename
md5
size
state
manifest(gdc_files)
:
manifest(GDCfilesResponse)
:
manifest(GDCcasesResponse)
:
gFiles = files() shortManifest = gFiles |> manifest(size=10) head(shortManifest,n=3)
gFiles = files() shortManifest = gFiles |> manifest(size=10) head(shortManifest,n=3)
Query GDC for available endpoint fields
mapping(endpoint)
mapping(endpoint)
endpoint |
character(1) corresponding to endpoints for which users may specify additional or alternative fields. Endpoints include “projects”, “cases”, “files”, and “annotations”. |
A data frame describing the field (field name), full (full data model name), type (data type), and four additional columns describing the "set" to which the fields belong–“default”, “expand”, “multi”, and “nested”.
map <- mapping("projects") head(map) # get only the "default" fields subset(map,defaults) # And get just the text names of the "default" fields subset(map,defaults)$field
map <- mapping("projects") head(map) # get only the "default" fields subset(map,defaults) # And get just the text names of the "default" fields subset(map,defaults)$field
The basis for all functionality in this package starts with constructing a query in R. The GDCQuery object contains the filters, facets, and other parameters that define the returned results. A token is required for accessing certain datasets.
query( entity, filters = NULL, facets = NULL, expand = NULL, fields = default_fields(entity), ... ) cases(...) files(...) projects(...) annotations(...) ssms(...) ssm_occurrences(...) cnvs(...) cnv_occurrences(...) genes(...)
query( entity, filters = NULL, facets = NULL, expand = NULL, fields = default_fields(entity), ... ) cases(...) files(...) projects(...) annotations(...) ssms(...) ssm_occurrences(...) cnvs(...) cnv_occurrences(...) genes(...)
entity |
character vector, including one of the entities in .gdc_entities |
filters |
a filter list, typically created using |
facets |
a character vector of facets for counting common values.
See |
expand |
a character vector of "expands" to include in returned data. See
|
fields |
a character vector of fields to return. See |
... |
passed through to |
An S3 object, the GDCQuery object. This is a list with the following members.
filters
facets
fields
expand
archive
token
cases()
: convenience constructor for a GDCQuery for cases
files()
: convenience contructor for a GDCQuery for files
projects()
: convenience contructor for a GDCQuery for projects
annotations()
: convenience contructor for a GDCQuery for annotations
ssms()
: convenience contructor for a GDCQuery for ssms
ssm_occurrences()
: convenience contructor for a GDCQuery for ssm_occurrences
cnvs()
: convenience contructor for a GDCQuery for cnvs
cnv_occurrences()
: convenience contructor for a GDCQuery for cnv_occurrences
genes()
: convenience contructor for a GDCQuery for genes
qcases = query('cases') # equivalent to: qcases = cases()
qcases = query('cases') # equivalent to: qcases = cases()
Read DNAcopy results into GRanges object
readDNAcopy(fname, ...)
readDNAcopy(fname, ...)
fname |
The path to a DNAcopy-like file. |
... |
passed to |
a GRanges
object
fname = system.file(package='GenomicDataCommons', 'extdata/dnacopy.tsv.gz') dnac = readDNAcopy(fname) class(dnac) length(dnac)
fname = system.file(package='GenomicDataCommons', 'extdata/dnacopy.tsv.gz') dnac = readDNAcopy(fname) class(dnac) length(dnac)
The htseq package is used extensively to count reads relative to regions (see http://www-huber.embl.de/HTSeq/doc/counting.html). The output of htseq-count is a simple two-column table that includes features in column 1 and counts in column 2. This function simply reads in the data from one such file and assigns column names.
readHTSeqFile(fname, samplename = "sample", ...)
readHTSeqFile(fname, samplename = "sample", ...)
fname |
character(1), the path of the htseq-count file. |
samplename |
character(1), the name of the sample. This will
become the name of the second column on the resulting
|
... |
passed to |
a two-column data frame
fname = system.file(package='GenomicDataCommons', 'extdata/example.htseq.counts.gz') dat = readHTSeqFile(fname) head(dat)
fname = system.file(package='GenomicDataCommons', 'extdata/example.htseq.counts.gz') dat = readHTSeqFile(fname) head(dat)
GDCQuery
metadata from GDCFetch GDCQuery
metadata from GDC
response(x, ...) ## S3 method for class 'GDCQuery' response(x, from = 0, size = 10, ..., response_handler = jsonlite::fromJSON) response_all(x, ...)
response(x, ...) ## S3 method for class 'GDCQuery' response(x, from = 0, size = 10, ..., response_handler = jsonlite::fromJSON) response_all(x, ...)
x |
a |
... |
passed to httr (good for passing config info, etc.) |
from |
integer index from which to start returning data |
size |
number of records to return |
response_handler |
a function that processes JSON (as text)
and returns an R object. Default is |
A GDCResponse
object which is a list with the following
members:
results
query
aggregations
pages
# basic class stuff gCases = cases() resp = response(gCases) class(resp) names(resp) # And results from query resp$results[[1]]
# basic class stuff gCases = cases() resp = response(gCases) class(resp) names(resp) # And results from query resp$results[[1]]
results
results(x, ...) ## S3 method for class 'GDCQuery' results(x, ...) ## S3 method for class 'GDCResponse' results(x, ...)
results(x, ...) ## S3 method for class 'GDCQuery' results(x, ...) ## S3 method for class 'GDCResponse' results(x, ...)
x |
a |
... |
passed on to |
A (typically nested) list
of GDC records
results(GDCQuery)
:
results(GDCResponse)
:
qcases = cases() |> results() length(qcases)
qcases = cases() |> results() length(qcases)
results_all
results_all(x) ## S3 method for class 'GDCQuery' results_all(x) ## S3 method for class 'GDCResponse' results_all(x)
results_all(x) ## S3 method for class 'GDCQuery' results_all(x) ## S3 method for class 'GDCResponse' results_all(x)
x |
a |
A (typically nested) list
of GDC records
results_all(GDCQuery)
:
results_all(GDCResponse)
:
# details of all available projects projResults = projects() |> results_all() length(projResults) count(projects())
# details of all available projects projResults = projects() |> results_all() length(projResults) count(projects())
S3 generic to set GDCQuery fields
select(x, fields) ## S3 method for class 'GDCQuery' select(x, fields)
select(x, fields) ## S3 method for class 'GDCQuery' select(x, fields)
x |
the objects on which to set fields |
fields |
a character vector specifying the fields |
A GDCQuery
object, with the fields
member altered.
select(GDCQuery)
: set fields on a GDCQuery object
gProj = projects() gProj$fields head(available_fields(gProj)) default_fields(gProj) gProj |> select(default_fields(gProj)[1:2]) |> response() |> str(max_level=2)
gProj = projects() gProj$fields head(available_fields(gProj)) default_fields(gProj) gProj |> select(default_fields(gProj)[1:2]) |> response() |> str(max_level=2)
This function returns a BAM file representing reads overlapping regions specified either as chromosomal regions or as gencode gene symbols.
slicing( uuid, regions, symbols, destination = file.path(tempdir(), paste0(uuid, ".bam")), overwrite = FALSE, progress = interactive(), token = gdc_token() )
slicing( uuid, regions, symbols, destination = file.path(tempdir(), paste0(uuid, ".bam")), overwrite = FALSE, progress = interactive(), token = gdc_token() )
uuid |
character(1) identifying the BAM file resource |
regions |
character() vector describing chromosomal regions,
e.g., |
symbols |
character() vector of gencode gene symbols, e.g.,
|
destination |
character(1) default |
overwrite |
logical(1) default FALSE can destination be overwritten? |
progress |
logical(1) default |
token |
character(1) security token allowing access to restricted data. Almost all BAM data is restricted, so a token is usually required. See https://docs.gdc.cancer.gov/Data/Data_Security/Data_Security/#authentication-tokens. |
This function uses the Genomic Data Commons "slicing" API to get portions of a BAM file specified either using "regions" or using HGNC gene symbols.
character(1) destination to the downloaded BAM file
## Not run: slicing("df80679e-c4d3-487b-934c-fcc782e5d46e", regions="chr17:75000000-76000000", token=gdc_token()) # Get 10 BAM files. bamfiles = files() |> filter(data_format=='BAM') |> results(size=10) |> ids() # Current alignments at the GDC are to GRCh38 library('TxDb.Hsapiens.UCSC.hg38.knownGene') all_genes = genes(TxDb.Hsapiens.UCSC.hg38.knownGene) first3genes = all_genes[1:3] # remove strand info strand(first3genes) = '*' # We can get our regions easily now as.character(first3genes) # Use parallel downloads to speed processing library(BiocParallel) register(MulticoreParam()) fnames = bplapply(bamfiles, slicing, overwrite = TRUE, regions=as.character(first3genes)) # 10 BAM files fnames library(GenomicAlignments) lapply(unlist(fnames), readGAlignments) ## End(Not run)
## Not run: slicing("df80679e-c4d3-487b-934c-fcc782e5d46e", regions="chr17:75000000-76000000", token=gdc_token()) # Get 10 BAM files. bamfiles = files() |> filter(data_format=='BAM') |> results(size=10) |> ids() # Current alignments at the GDC are to GRCh38 library('TxDb.Hsapiens.UCSC.hg38.knownGene') all_genes = genes(TxDb.Hsapiens.UCSC.hg38.knownGene) first3genes = all_genes[1:3] # remove strand info strand(first3genes) = '*' # We can get our regions easily now as.character(first3genes) # Use parallel downloads to speed processing library(BiocParallel) register(MulticoreParam()) fnames = bplapply(bamfiles, slicing, overwrite = TRUE, regions=as.character(first3genes)) # 10 BAM files fnames library(GenomicAlignments) lapply(unlist(fnames), readGAlignments) ## End(Not run)
Query the GDC for current status
status(version = NULL)
status(version = NULL)
version |
(optional) character(1) version of GDC |
List describing current status.
status()
status()
The GDC maintains a special tool, https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/, that enables high-performance, potentially parallel, and resumable downloads. The Data Transfer Tool is an external program that requires separate download. Due to recent changes in the GDC API, the transfer function now validates the version of the 'gdc-client' to ensure reliable downloads.
transfer(uuids, args = character(), token = NULL, overwrite = FALSE) gdc_client_version_validate(valid_version = .GDC_COMPATIBLE_VERSION) transfer_help()
transfer(uuids, args = character(), token = NULL, overwrite = FALSE) gdc_client_version_validate(valid_version = .GDC_COMPATIBLE_VERSION) transfer_help()
uuids |
character() vector of GDC file UUIDs |
args |
character() vector specifying command-line arguments to
be passed to |
token |
character(1) containing security token allowing access to restricted data. See https://gdc-docs.nci.nih.gov/API/Users_Guide/Authentication_and_Authorization/. Note that the GDC transfer tool requires a file for data transfer. Therefore, this token will be written to a temporary file (with appropriate permissions set). |
overwrite |
logical(1) default FALSE indicating whether existing files with identical name should be over-written. |
valid_version |
character(1) The last known version that works for the current data release for which to validate against, not typically changed by the end-user. |
character(1) directory path to which the files were downloaded.
gdc_client_version_validate()
: If you are using the 'client' option, your 'gdc-client' should be
up-to-date (>= 1.3.0).
transfer_help()
:
## Not run: uuids = files() |> filter(access == "open") |> results() |> ids() file_paths <- transfer(uuids) file_paths names(file_paths) # and with authenication # REQUIRES gdc_token # destination <- transfer(uuids,token=gdc_token()) ## End(Not run)
## Not run: uuids = files() |> filter(access == "open") |> results() |> ids() file_paths <- transfer(uuids) file_paths names(file_paths) # and with authenication # REQUIRES gdc_token # destination <- transfer(uuids,token=gdc_token()) ## End(Not run)
The manifest
method creates a data.frame
that represents the data for a manifest file needed
by the GDC Data Transfer Tool. While the file format
is nothing special, this is a simple helper function
to write a manifest data.frame to disk. It returns
the path to which the file is written, so it can
be used "in-line" in a call to transfer
.
write_manifest(manifest, destfile = tempfile())
write_manifest(manifest, destfile = tempfile())
manifest |
A data.frame with five columns, typically
created by a call to |
destfile |
The filename for saving the manifest. |
character(1) the destination file name.
mf = files() |> manifest(size=10) write_manifest(mf)
mf = files() |> manifest(size=10) write_manifest(mf)