From https://support.bioconductor.org/p/9138939/.
I made a small change to the filtering expression approach based on
changes to lazy evaluation best practices. There is now no need to
include the ~
in the filter expression. So:
q = files() |>
GenomicDataCommons::filter(
cases.project.project_id == 'TCGA-COAD' &
data_type == 'Aligned Reads' &
experimental_strategy == 'RNA-Seq' &
data_format == 'BAM')
And get a count of the results:
## [1] 1188
And the manifest.
## # A tibble: 1,188 × 26
## id data_format access file_name wgs_coverage submitter_id data_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 b5b0324… BAM contr… 51b2184b… Not Applica… 55d8a305-46… Sequencing R…
## 2 fb0ea22… BAM contr… 6fdaf546… Not Applica… 2c005691-0d… Sequencing R…
## 3 87da2a2… BAM contr… ed61dd33… Not Applica… 4fb57f6f-0e… Sequencing R…
## 4 79126fe… BAM contr… 6fdaf546… Not Applica… 1c69757e-91… Sequencing R…
## 5 c91e5d6… BAM contr… 05167d53… Not Applica… b6d76daf-7b… Sequencing R…
## 6 8cd0db1… BAM contr… ee24b470… Not Applica… c21601db-c9… Sequencing R…
## 7 6f47823… BAM contr… b83a1cf6… Not Applica… 07d697d8-1a… Sequencing R…
## 8 6d0b8cc… BAM contr… da8c5a43… Not Applica… c99eab1e-c5… Sequencing R…
## 9 085a55a… BAM contr… 8939877c… Not Applica… 7365a3e3-c0… Sequencing R…
## 10 fa292a9… BAM contr… 48128da3… Not Applica… 1b0da154-3e… Sequencing R…
## # ℹ 1,178 more rows
## # ℹ 19 more variables: acl_1 <chr>, type <chr>, platform <chr>,
## # file_size <dbl>, created_datetime <chr>, md5sum <chr>,
## # updated_datetime <chr>, file_id <chr>, data_type <chr>, state <chr>,
## # experimental_strategy <chr>, proportion_reads_mapped <dbl>,
## # proportion_base_mismatch <dbl>, pairs_on_diff_chr <int>, total_reads <int>,
## # proportion_reads_duplicated <int>, average_base_quality <int>, …
Your question about race and ethnicity is a good one.
And we can grep for race
or ethnic
to get
potential matching fields to look at.
## [1] "cases.demographic.ethnicity"
## [2] "cases.demographic.race"
## [3] "cases.follow_ups.hormonal_contraceptive_type"
## [4] "cases.follow_ups.hormonal_contraceptive_use"
## [5] "cases.follow_ups.other_clinical_attributes.hormonal_contraceptive_type"
## [6] "cases.follow_ups.other_clinical_attributes.hormonal_contraceptive_use"
## [7] "cases.follow_ups.scan_tracer_used"
Now, we can check available values for each field to determine how to complete our filter expressions.
## [1] "not hispanic or latino" "not reported" "hispanic or latino"
## [4] "unknown" "_missing"
## [1] "white"
## [2] "not reported"
## [3] "black or african american"
## [4] "asian"
## [5] "unknown"
## [6] "other"
## [7] "american indian or alaska native"
## [8] "native hawaiian or other pacific islander"
## [9] "not allowed to collect"
## [10] "_missing"
We can complete our filter expression now to limit to
white
race only.
## [1] 695
## # A tibble: 695 × 26
## id data_format access file_name wgs_coverage submitter_id data_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 fb0ea22… BAM contr… 6fdaf546… Not Applica… 2c005691-0d… Sequencing R…
## 2 79126fe… BAM contr… 6fdaf546… Not Applica… 1c69757e-91… Sequencing R…
## 3 8cd0db1… BAM contr… ee24b470… Not Applica… c21601db-c9… Sequencing R…
## 4 fa292a9… BAM contr… 48128da3… Not Applica… 1b0da154-3e… Sequencing R…
## 5 48aab61… BAM contr… 7f9dadec… Not Applica… 4f46c37e-a7… Sequencing R…
## 6 d0a01de… BAM contr… 57c4a274… Not Applica… 7a9907c9-63… Sequencing R…
## 7 c296200… BAM contr… 4c7083b3… Not Applica… faecf96e-9e… Sequencing R…
## 8 85535fa… BAM contr… f40a9044… Not Applica… 3131b3d3-2c… Sequencing R…
## 9 5598442… BAM contr… ec054d40… Not Applica… 8f7601b9-cf… Sequencing R…
## 10 3f50c43… BAM contr… 7fc1ca50… Not Applica… b935c54c-6d… Sequencing R…
## # ℹ 685 more rows
## # ℹ 19 more variables: acl_1 <chr>, type <chr>, platform <chr>,
## # file_size <dbl>, created_datetime <chr>, md5sum <chr>,
## # updated_datetime <chr>, file_id <chr>, data_type <chr>, state <chr>,
## # experimental_strategy <chr>, proportion_reads_mapped <dbl>,
## # proportion_base_mismatch <dbl>, pairs_on_diff_chr <int>, total_reads <int>,
## # proportion_reads_duplicated <int>, average_base_quality <int>, …
GenomicDataCommons
?I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:GenomicDataCommons':
##
## count, filter, select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GenomicDataCommons)
cases() |>
GenomicDataCommons::filter(
~ project.program.name=='TCGA' & files.experimental_strategy=='RNA-Seq'
) |>
facet(c("files.created_datetime")) |>
aggregations() |>
unname() |>
unlist(recursive = FALSE) |>
as_tibble() |>
dplyr::arrange(dplyr::desc(key))
## # A tibble: 200 × 2
## doc_count key
## <int> <chr>
## 1 164 2024-06-14t14:27:00.916424-05:00
## 2 412 2024-06-14t13:28:10.644120-05:00
## 3 151 2023-03-09t00:35:51.387873-06:00
## 4 79 2023-02-19t04:41:11.008116-06:00
## 5 458 2023-02-19t04:36:10.605050-06:00
## 6 80 2023-02-19t04:28:49.400023-06:00
## 7 178 2023-02-19t04:23:49.092629-06:00
## 8 516 2023-02-19t04:18:49.453628-06:00
## 9 179 2023-02-19t04:13:47.877168-06:00
## 10 290 2023-02-19t04:08:47.478925-06:00
## # ℹ 190 more rows