Title: | The GCP R Client for the AnVIL |
---|---|
Description: | The package provides a set of functions to interact with the Google Cloud Platform (GCP) services on the AnVIL platform. The package is designed to work with the AnVIL package. User-level interaction with this package should be minimal. |
Authors: | Marcel Ramos [aut, cre] , Nitesh Turaga [aut], Martin Morgan [aut] |
Maintainer: | Marcel Ramos <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.1.1 |
Built: | 2024-10-30 17:14:42 UTC |
Source: | https://github.com/Bioconductor/AnVILGCP |
avtable_import_status()
queries for the status of an
'asynchronous' table import.
avfiles_ls()
returns the paths of files in the
workspace bucket. avfiles_backup()
copies files from the
compute node file system to the workspace bucket.
avfiles_restore()
copies files from the workspace bucket to
the compute node file system. avfiles_rm()
removes files or
directories from the workspace bucket.
avruntimes()
returns a tibble containing information
about runtimes (notebooks or RStudio instances, for example)
that the current user has access to.
avruntime()
returns a tibble with the runtimes
associated with a particular google project and account number;
usually there is a single runtime satisfiying these criteria,
and it is the runtime active in AnVIL.
'avdisks()' returns a tibble containing information about persistent disks associatd with the current user.
avtable_paged( table, n = Inf, page = 1L, pageSize = 1000L, sortField = "name", sortDirection = c("asc", "desc"), filterTerms = character(), filterOperator = c("and", "or"), namespace = avworkspace_namespace(), name = avworkspace_name(), na = c("", "NA") ) avtable_import_status( job_status, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_ls( path = "", full_names = FALSE, recursive = FALSE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_backup( source, destination = "", recursive = FALSE, parallel = TRUE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_restore( source, destination = ".", recursive = FALSE, parallel = TRUE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_rm( source, recursive = FALSE, parallel = TRUE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avruntimes() avruntime(project = gcloud_project(), account = gcloud_account()) avdisks()
avtable_paged( table, n = Inf, page = 1L, pageSize = 1000L, sortField = "name", sortDirection = c("asc", "desc"), filterTerms = character(), filterOperator = c("and", "or"), namespace = avworkspace_namespace(), name = avworkspace_name(), na = c("", "NA") ) avtable_import_status( job_status, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_ls( path = "", full_names = FALSE, recursive = FALSE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_backup( source, destination = "", recursive = FALSE, parallel = TRUE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_restore( source, destination = ".", recursive = FALSE, parallel = TRUE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avfiles_rm( source, recursive = FALSE, parallel = TRUE, namespace = avworkspace_namespace(), name = avworkspace_name() ) avruntimes() avruntime(project = gcloud_project(), account = gcloud_account()) avdisks()
table |
character(1) table name as returned by, e.g., |
n |
numeric(1) maximum number of rows to return |
page |
integer(1) first page of iteration |
pageSize |
integer(1) number of records per page. Generally, larger page sizes are more efficient. |
sortField |
character(1) field used to sort records when determining page order. Default is the entity field. |
sortDirection |
character(1) direction to sort entities
( |
filterTerms |
character(1) string literal to select rows with an exact (substring) matches in column. |
filterOperator |
character(1) operator to use when multiple
terms in |
namespace |
|
name |
|
na |
in |
job_status |
tibble() of job identifiers, returned by
|
path |
For |
full_names |
logical(1) return names relative to |
recursive |
logical(1) list files recursively? |
source |
character() file paths. for |
destination |
character(1) a google bucket
( |
parallel |
logical(1) backup files using parallel transfer?
See |
project |
|
account |
|
avfiles_backup()
can be used to back-up individual files
or entire directories, recursively. When recursive = FALSE
,
files are backed up to the bucket with names approximately
paste0(destination, "/", basename(source))
. When recursive = TRUE
and source is a directory path/to/foo/', files are backed up to bucket names that include the directory name, approximately
paste0(destination, "/", dir(basename(source),
full.names = TRUE)). Naming conventions are described in detail in
gsutil_help("cp")'.
avfiles_restore()
behaves in a manner analogous to
avfiles_backup()
, copying files from the workspace bucket to
the compute node file system.
avtable_paged()
: a tibble of data corresponding to the
AnVIL table table
in the specified workspace.
avfiles_ls()
returns a character vector of files in the
workspace bucket.
avfiles_backup()
returns, invisibly, the status code of the
avcopy()
command used to back up the files.
avfiles_rm()
on success, returns a list of the return
codes of avremove()
, invisibly.
avruntimes()
returns a tibble with columns
id: integer() runtime identifier.
googleProject: character() billing account.
tool: character() e.g., "Jupyter", "RStudio".
status character() e.g., "Stopped", "Running".
creator character() AnVIL account, typically "[email protected]".
createdDate character() creation date.
destroyedDate character() destruction date, or NA.
dateAccessed character() date of (first?) access.
runtimeName character().
clusterServiceAccount character() service ('pet') account for this runtime.
masterMachineType character() It is unclear which 'tool' populates which of the machineType columns).
workerMachineType character().
machineType character().
persistentDiskId integer() identifier of persistent disk (see
avdisks()
), or NA
.
avruntime()
returns a tibble witht he same structure as
the return value of avruntimes()
.
avdisks()
returns a tibble with columns
id character() disk identifier.
googleProject: character() billing account.
status, e.g, "Ready"
size integer() in GB.
diskType character().
blockSize integer().
creator character() AnVIL account, typically "[email protected]".
createdDate character() creation date.
destroyedDate character() destruction date, or NA.
dateAccessed character() date of (first?) access.
zone character() e.g.. "us-central1-a".
name character().
library(AnVILBase) if (has_avworkspace(platform = gcp())) avfiles_ls() library(AnVILBase) if (has_avworkspace(platform = gcp()) && interactive()) { ## backup all files in the current directory ## default buckets are gs://<bucket-id>/<file-names> avfiles_backup(dir()) ## backup working directory, recursively ## default buckets are gs://<bucket-id>/<basename(getwd())>/... avfiles_backup(getwd(), recursive = TRUE) } if (has_avworkspace(platform = gcp())) ## from within AnVIL avruntimes() if (has_avworkspace(strict = TRUE, platform = gcp())) ## from within AnVIL avdisks()
library(AnVILBase) if (has_avworkspace(platform = gcp())) avfiles_ls() library(AnVILBase) if (has_avworkspace(platform = gcp()) && interactive()) { ## backup all files in the current directory ## default buckets are gs://<bucket-id>/<file-names> avfiles_backup(dir()) ## backup working directory, recursively ## default buckets are gs://<bucket-id>/<basename(getwd())>/... avfiles_backup(getwd(), recursive = TRUE) } if (has_avworkspace(platform = gcp())) ## from within AnVIL avruntimes() if (has_avworkspace(strict = TRUE, platform = gcp())) ## from within AnVIL avdisks()
avdata()
returns key-value tables representing the
information visualized under the DATA tab, 'REFERENCE DATA' and
'OTHER DATA' items. avdata_import()
updates (modifies or
creates new, but does not delete) rows in 'REFERENCE DATA' or
'OTHER DATA' tables.
avdata(namespace = avworkspace_namespace(), name = avworkspace_name()) avdata_import( .data, namespace = avworkspace_namespace(), name = avworkspace_name() )
avdata(namespace = avworkspace_namespace(), name = avworkspace_name()) avdata_import( .data, namespace = avworkspace_namespace(), name = avworkspace_name() )
namespace |
|
name |
|
.data |
A tibble or data.frame for import as an AnVIL table. |
avdata()
returns a tibble with five columns: "type"
represents the origin of the data from the 'REFERENCE' or
'OTHER' data menus. "table"
is the table name in the
REFERENCE
menu, or 'workspace' for the table in the 'OTHER'
menu, the key used to access the data element, the value label
associated with the data element and the value (e.g., google
bucket) of the element.
avdata_import()
returns, invisibly, the subset of the
input table used to update the AnVIL tables.
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) { ## from within AnVIL data <- avdata() data if (interactive()) avdata_import(data) }
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) { ## from within AnVIL data <- avdata() data if (interactive()) avdata_import(data) }
avnotebooks()
returns the names of the notebooks
associated with the current workspace.
## S4 method for signature 'gcp' avnotebooks( local = FALSE, namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avnotebooks_localize( destination, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avnotebooks_delocalize( source, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE, ..., platform = cloud_platform() )
## S4 method for signature 'gcp' avnotebooks( local = FALSE, namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avnotebooks_localize( destination, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avnotebooks_delocalize( source, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE, ..., platform = cloud_platform() )
local |
= |
namespace |
|
name |
|
... |
Additional arguments passed to lower level functions (not used). |
platform |
|
destination |
missing or character(1) file path to the local
file system directory for synchronization. The default location
is |
dry |
|
source |
missing or character(1) file path to the local file
system directory for synchronization. The default location is
|
avnotebooks()
returns a character vector of buckets /
files located in the workspace 'Files/notebooks' bucket path,
or on the local file system.
avnotebooks_localize()
returns the exit status of
gsutil_rsync()
.
avnotebooks_delocalize()
returns the exit status of
gsutil_rsync()
.
avnotebooks(gcp)
: List notebooks in the workspace
avnotebooks_localize(gcp)
: Synchronizes the content of the workspace
bucket to the local file system.
avnotebooks_delocalize(gcp)
: Synchronizes the content of the notebook
location of the local file system to the workspace bucket.
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) { avnotebooks() avnotebooks_localize() # dry run try(avnotebooks_delocalize()) # dry run, fails if no local resource }
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) { avnotebooks() avnotebooks_localize() # dry run try(avnotebooks_delocalize()) # dry run, fails if no local resource }
Tables can be visualized under the DATA tab, TABLES
item. avtable()
returns an AnVIL table. avtable_paged()
retrieves an AnVIL table by requesting the table in 'chunks',
and may be appropriate for large tables. avtable_import()
imports a data.frame to an AnVIL table. avtable_import_set()
imports set membership (i.e., a subset of an existing table)
information to an AnVIL table. avtable_delete_values()
removes rows from an AnVIL table.
## S4 method for signature 'gcp' avtables( namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable( table, namespace = avworkspace_namespace(), name = avworkspace_name(), na = c("", "NA"), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_import( .data, entity = names(.data)[[1L]], namespace = avworkspace_namespace(), name = avworkspace_name(), delete_empty_values = FALSE, na = "NA", n = Inf, page = 1L, pageSize = NULL, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_import_set( .data, origin, set = names(.data)[[1]], member = names(.data)[[2]], namespace = avworkspace_namespace(), name = avworkspace_name(), delete_empty_values = FALSE, na = "NA", n = Inf, page = 1L, pageSize = NULL, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_delete( table, namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_delete_values( table, values, namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() )
## S4 method for signature 'gcp' avtables( namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable( table, namespace = avworkspace_namespace(), name = avworkspace_name(), na = c("", "NA"), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_import( .data, entity = names(.data)[[1L]], namespace = avworkspace_namespace(), name = avworkspace_name(), delete_empty_values = FALSE, na = "NA", n = Inf, page = 1L, pageSize = NULL, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_import_set( .data, origin, set = names(.data)[[1]], member = names(.data)[[2]], namespace = avworkspace_namespace(), name = avworkspace_name(), delete_empty_values = FALSE, na = "NA", n = Inf, page = 1L, pageSize = NULL, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_delete( table, namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avtable_delete_values( table, values, namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() )
namespace |
|
name |
|
... |
Additional arguments passed to lower level functions (not used). |
platform |
|
table |
character(1) table name as returned by, e.g., |
na |
in |
.data |
A tibble or data.frame for import as an AnVIL table. |
entity |
|
delete_empty_values |
logical(1) when |
n |
numeric(1) maximum number of rows to return |
page |
integer(1) first page of iteration |
pageSize |
integer(1) number of records per page. Generally, larger page sizes are more efficient. |
origin |
character(1) name of the entity (table) used to create the set e.g "sample", "participant", etc. |
set |
|
member |
|
values |
vector of values in the entity (key) column of
|
Treatment of missing values in avtable()
,
avtable_paged()
and avtable_import()
are handled by the
na
parameter.
avtable()
may sometimes result in a curl error 'Error in
curl::curl_fetch_memory' or a 'Internal Server Error (HTTP
500)' This may be due to a server time-out when trying to read
a large (more than 50,000 rows?) table; using avtable_paged()
may address this problem.
For avtable()
and avtable_paged()
, the default na = c("", "NA")
treats empty cells or cells containing "NA" in a Terra
data table as NA_character_
in R. Use na = character()
to
indicate no missing values, na = "NA"
to retain the
distinction between ""
and NA_character_
.
For avtable_import()
, the default na = "NA"
records
NA_character_
in R as the character string "NA"
in an AnVIL
data table.
The default setting (na = "NA"
in avtable_import()
,
na = c("", NA_character_")
in avtable()
, is appropriate to
'round-trip' data from R to AnVIL and back when character vectors
contain only NA_character_
. Use na = "NA"
in both functions to
round-trip data containing both NA_character_
and "NA". Use
a distinct string, e.g., na = "__MISSING_VALUE__"
, for both
arguments if the data contains a string "NA"
as well as
NA_character_
.
avtable_import()
tries to work around limitations in
.data
size in the AnVIL platform, using pageSize
(number of
rows) to import so that approximately 1500000 elements (rows x
columns) are uploaded per chunk. For large .data
, a progress
bar summarizes progress on the import. Individual chunks may
nonetheless fail to upload, with common reasons being an
internal server error (HTTP error code 500) or transient
authorization failure (HTTP 401). In these and other cases
avtable_import()
reports the failed page(s) as warnings. The
user can attempt to import these individually using the page
argument. If many pages fail to import, a strategy might be to
provide an explicit pageSize
less than the automatically
determined size.
avtable_import_set()
creates new rows in a table
<origin>_set
. One row will be created for each distinct value
in the column identified by set
. Each row entry has a
corresponding column <origin>
linking to one or more rows in
the <origin>
table, as given in the member
column. The
operation is somewhat like split(member, set)
.
avtables()
: A tibble with columns identifying the table,
the number of records, and the column names.
avtable()
: a tibble of data corresponding to the AnVIL
table table
in the specified workspace.
avtable_import_set()
returns a character(1)
name of the
imported AnVIL tibble.
avtable_delete()
returns TRUE
if the table is successfully
deleted.
avtable_delete_values()
returns a tibble
representing
deleted entities, invisibly.
avtables(gcp)
: avtables()
describes tables available in a
workspace
avtable(gcp)
: avtable()
retrieves a table from an AnVIL
workspace
avtable_import(gcp)
: upload a table to the DATA tab
avtable_import_set(gcp)
:
avtable_delete(gcp)
: Delete a table from the AnVIL workspace.
avtable_delete_values(gcp)
:
if (interactive()) { avtables("waldronlab-terra", "Tumor_Only_CNV") avtable("participant", "waldronlab-terra", "Tumor_Only_CNV") library(dplyr) ## mtcars dataset mtcars_tbl <- mtcars |> as_tibble(rownames = "model_id") |> mutate(model_id = gsub(" ", "-", model_id)) avworkspace("waldronlab-terra/mramos-wlab-gcp-0") avstatus <- avtable_import(mtcars_tbl) avtable_import_status(avstatus) set_status <- avtable("model") |> avtable_import_set("model", "cyl", "model_id") avtable_import_status(set_status) ## won't be able to delete a row that is referenced in another table avtable_delete_values("model", "Mazda-RX4") ## delete the set avtable_delete("model_set") ## then delete the row avtable_delete_values("model", "Mazda-RX4") ## recreate the set (if needed) avtable("model") |> avtable_import_set("model", "cyl", "model_id") } library(AnVILBase) if (has_avworkspace(platform = gcp()) && interactive()) { ## editable copy of '1000G-high-coverage-2019' workspace avworkspace("bioconductor-rpci-anvil/1000G-high-coverage-2019") sample <- avtable("sample") %>% # existing table mutate(set = sample(head(LETTERS), nrow(.), TRUE)) # arbitrary groups sample %>% # new 'participant_set' table avtable_import_set("participant", "set", "participant") sample %>% # new 'sample_set' table avtable_import_set("sample", "set", "name") }
if (interactive()) { avtables("waldronlab-terra", "Tumor_Only_CNV") avtable("participant", "waldronlab-terra", "Tumor_Only_CNV") library(dplyr) ## mtcars dataset mtcars_tbl <- mtcars |> as_tibble(rownames = "model_id") |> mutate(model_id = gsub(" ", "-", model_id)) avworkspace("waldronlab-terra/mramos-wlab-gcp-0") avstatus <- avtable_import(mtcars_tbl) avtable_import_status(avstatus) set_status <- avtable("model") |> avtable_import_set("model", "cyl", "model_id") avtable_import_status(set_status) ## won't be able to delete a row that is referenced in another table avtable_delete_values("model", "Mazda-RX4") ## delete the set avtable_delete("model_set") ## then delete the row avtable_delete_values("model", "Mazda-RX4") ## recreate the set (if needed) avtable("model") |> avtable_import_set("model", "cyl", "model_id") } library(AnVILBase) if (has_avworkspace(platform = gcp()) && interactive()) { ## editable copy of '1000G-high-coverage-2019' workspace avworkspace("bioconductor-rpci-anvil/1000G-high-coverage-2019") sample <- avtable("sample") %>% # existing table mutate(set = sample(head(LETTERS), nrow(.), TRUE)) # arbitrary groups sample %>% # new 'participant_set' table avtable_import_set("participant", "set", "participant") sample %>% # new 'sample_set' table avtable_import_set("sample", "set", "name") }
Funtions on this help page facilitate getting,
updating, and setting workflow configuration parameters. See
?avworkflow
for additional relevant functionality.
avworkflow_namespace()
and avworkflow_name()
are
utility functions to record the workflow namespace and name
required when working with workflow
configurations. avworkflow()
provides a convenient way to
provide workflow namespace and name in a single command,
namespace/name
.
avworkflow_configuration_get()
returns a list structure
describing an existing workflow configuration.
avworkflow_configuration_inputs()
returns a
data.frame template for the inputs defined in a workflow
configuration. This template can be used to provide custom
inputs for a configuration.
avworkflow_configuration_outputs()
returns a
data.frame template for the outputs defined in a workflow
configuration. This template can be used to provide custom
outputs for a configuration.
avworkflow_configuration_update()
returns a list structure
describing a workflow configuration with updated inputs and / or outputs.
avworkflow_configuration_set()
updates an existing
configuration in Terra / AnVIL, e.g., changing inputs to the
workflow.
avworkflow_configuration_template()
returns a
template for defining workflow configurations. This template
can be used as a starting point for providing a custom
configuration.
avworkflow_namespace(workflow_namespace = NULL) avworkflow_name(workflow_name = NULL) avworkflow(workflow = NULL) avworkflow_configuration_get( workflow_namespace = avworkflow_namespace(), workflow_name = avworkflow_name(), namespace = avworkspace_namespace(), name = avworkspace_name() ) avworkflow_configuration_inputs(config) avworkflow_configuration_outputs(config) avworkflow_configuration_update( config, inputs = avworkflow_configuration_inputs(config), outputs = avworkflow_configuration_outputs(config) ) avworkflow_configuration_set( config, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE ) avworkflow_configuration_template() ## S3 method for class 'avworkflow_configuration' print(x, ...)
avworkflow_namespace(workflow_namespace = NULL) avworkflow_name(workflow_name = NULL) avworkflow(workflow = NULL) avworkflow_configuration_get( workflow_namespace = avworkflow_namespace(), workflow_name = avworkflow_name(), namespace = avworkspace_namespace(), name = avworkspace_name() ) avworkflow_configuration_inputs(config) avworkflow_configuration_outputs(config) avworkflow_configuration_update( config, inputs = avworkflow_configuration_inputs(config), outputs = avworkflow_configuration_outputs(config) ) avworkflow_configuration_set( config, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE ) avworkflow_configuration_template() ## S3 method for class 'avworkflow_configuration' print(x, ...)
workflow_namespace |
character(1) AnVIL workflow namespace, as
returned by, e.g., the |
workflow_name |
character(1) AnVIL workflow name, as returned
by, e.g., the |
workflow |
character(1) representing the combined workflow
namespace and name, as |
namespace |
|
name |
|
config |
a named list describing the full configuration, e.g.,
created from editing the return value of
|
inputs |
the new inputs to be updated in the workflow configuration. If none are specified, the inputs from the original configuration will be used and no changes will be made. |
outputs |
the new outputs to be updated in the workflow configuration. If none are specified, the outputs from the original configuration will be used and no changes will be made. |
dry |
logical(1) when |
x |
Object of class |
... |
additional arguments to |
The exact format of the configuration is important.
One common problem is that a scalar character vector "bar"
is
interpretted as a json 'array' ["bar"]
rather than a json string
"bar"
. Enclose the string with jsonlite::unbox("bar")
in the
configuration list if the length 1 character vector in R is to be
interpretted as a json string.
A second problem is that an unquoted unboxed character string
unbox("foo")
is required by AnVIL to be quoted. This is reported
as a warning() about invalid inputs or outputs, and the solution is
to provide a quoted string unbox('"foo"')
.
avworkflow_namespace()
, and avworkflow_name()
return
character(1)
identifiers. avworkflow()
returns the
character(1) concatenated namespace and name. The value
returned by avworkflow_name()
will be percent-encoded (e.g.,
spaces " "
replaced by "%20"
).
avworkflow_configuration_get()
returns a list structure
describing the configuration. See
avworkflow_configuration_template()
for the structure of a
typical workflow.
avworkflow_configuration_inputs()
returns a data.frame
providing a template for the configuration inputs, with the
following columns:
inputType
name
optional
attribute
The only column of interest to the user is the attribute
column, this is the column that should be changed for
customization.
avworkflow_configuration_outputs()
returns a data.frame
providing a template for the configuration outputs, with the
following columns:
name
outputType
attribute
The only column of interest to the user is the attribute
column, this is the column that should be changed for
customization.
avworkflow_configuration_update()
returns a list structure
describing the updated configuration.
avworkflow_configuration_set()
returns an object
describing the updated configuration. The return value includes
invalid or unused elements of the config
input. Invalid or
unused elements of config
are also reported as a warning.
avworkflow_configuration_template()
returns a list
providing a template for configuration lists, with the
following structure:
namespace character(1) configuration namespace.
name character(1) configuration name.
rootEntityType character(1) or missing. the name of the table
(from avtables()
) containing the entitites referenced in
inputs, etc., by the keyword 'this.'
prerequisites named list (possibly empty) of prerequisites.
inputs named list (possibly empty) of inputs. Form of input
depends on method, and might include, e.g., a reference to a
field in a table referenced by avtables()
or a character string
defining an input constant.
outputs named list (possibly empty) of outputs.
methodConfigVersion integer(1) identifier for the method configuration.
methodRepoMethod named list describing the method, with
character(1) elements described in the return value for avworkflows()
.
methodUri
sourceRepo
methodPath
methodVersion. The REST specification indicates that this has
type integer
, but the documentation indicates either
integer
or string
.
deleted logical(1) of uncertain purpose.
The help page ?avworkflow
for discovering, running,
stopping, and retrieving outputs from workflows.
if (has_avworkspace(platform = gcp())) { ## set the namespace and name as appropriate avworkspace("bioconductor-rpci-anvil/Bioconductor-Workflow-DESeq2") ## discover available workflows in the workspace avworkflows() ## record the workflow of interest avworkflow("bioconductor-rpci-anvil/AnVILBulkRNASeq") ## what workflows are available? available_workflows <- avworkflows() ## retrieve the current configuration config <- avworkflow_configuration_get() config ## what are the inputs and outputs? inputs <- avworkflow_configuration_inputs(config) inputs outputs <- avworkflow_configuration_outputs(config) outputs ## update inputs or outputs, e.g., this input can be anything... inputs <- inputs |> dplyr::mutate(attribute = ifelse( name == "salmon.transcriptome_index_name", '"new_index_name"', attribute )) new_config <- avworkflow_configuration_update(config, inputs) new_config ## set the new configuration in AnVIL; use dry = FALSE to actually ## update the configuration avworkflow_configuration_set(config) } ## avworkflow_configuration_template() is a utility function that may ## help understanding what the inputs and outputs should be avworkflow_configuration_template() |> str() avworkflow_configuration_template()
if (has_avworkspace(platform = gcp())) { ## set the namespace and name as appropriate avworkspace("bioconductor-rpci-anvil/Bioconductor-Workflow-DESeq2") ## discover available workflows in the workspace avworkflows() ## record the workflow of interest avworkflow("bioconductor-rpci-anvil/AnVILBulkRNASeq") ## what workflows are available? available_workflows <- avworkflows() ## retrieve the current configuration config <- avworkflow_configuration_get() config ## what are the inputs and outputs? inputs <- avworkflow_configuration_inputs(config) inputs outputs <- avworkflow_configuration_outputs(config) outputs ## update inputs or outputs, e.g., this input can be anything... inputs <- inputs |> dplyr::mutate(attribute = ifelse( name == "salmon.transcriptome_index_name", '"new_index_name"', attribute )) new_config <- avworkflow_configuration_update(config, inputs) new_config ## set the new configuration in AnVIL; use dry = FALSE to actually ## update the configuration avworkflow_configuration_set(config) } ## avworkflow_configuration_template() is a utility function that may ## help understanding what the inputs and outputs should be avworkflow_configuration_template() |> str() avworkflow_configuration_template()
Methods for working with AnVIL workflow execution.
avworkflow_jobs()
returns a tibble summarizing submitted workflow jobs for
a namespace and name.
## S4 method for signature 'gcp' avworkflow_jobs( namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() )
## S4 method for signature 'gcp' avworkflow_jobs( namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() )
namespace |
|
name |
|
... |
Additional arguments passed to lower level functions (not used). |
platform |
|
avworkflow_jobs()
returns a tibble
, sorted by
submissionDate
, with columns
submissionId character() job identifier from the workflow runner.
submitter character() AnVIL user id of individual submitting the job.
submissionDate POSIXct() date (in local time zone) of job submission.
status character() job status, with values 'Accepted' 'Evaluating' 'Submitting' 'Submitted' 'Aborting' 'Aborted' 'Done'
succeeded integer() number of workflows succeeding.
failed integer() number of workflows failing.
avworkflow_jobs(gcp)
: List workflow jobs in the workspace
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) ## from within AnVIL avworkflow_jobs()
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) ## from within AnVIL avworkflow_jobs()
avworkflows()
returns a tibble summarizing available
workflows.
avworkflow_files()
returns a tibble containing
information and file paths to workflow outputs.
avworkflow_localize()
creates or synchronizes a
local copy of files with files stored in the workspace bucket
and produced by the workflow.
avworkflow_run()
runs the workflow of the configuration.
avworkflow_stop()
stops the most recently submitted workflow
jub from running.
avworkflow_info()
returns a tibble containing workflow
information, including workflowName, status, start and end time,
inputs and outputs.
avworkflows(namespace = avworkspace_namespace(), name = avworkspace_name()) avworkflow_files( submissionId = NULL, workflowId = NULL, bucket, namespace = avworkspace_namespace(), name = avworkspace_name() ) avworkflow_localize( submissionId = NULL, workflowId = NULL, destination = NULL, type = c("control", "output", "all"), bucket = avstorage(), dry = TRUE ) avworkflow_run( config, entityName, entityType = config$rootEntityType, deleteIntermediateOutputFiles = FALSE, useCallCache = TRUE, useReferenceDisks = FALSE, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE ) avworkflow_stop( submissionId = NULL, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE ) avworkflow_info( submissionId = NULL, namespace = avworkspace_namespace(), name = avworkspace_name() )
avworkflows(namespace = avworkspace_namespace(), name = avworkspace_name()) avworkflow_files( submissionId = NULL, workflowId = NULL, bucket, namespace = avworkspace_namespace(), name = avworkspace_name() ) avworkflow_localize( submissionId = NULL, workflowId = NULL, destination = NULL, type = c("control", "output", "all"), bucket = avstorage(), dry = TRUE ) avworkflow_run( config, entityName, entityType = config$rootEntityType, deleteIntermediateOutputFiles = FALSE, useCallCache = TRUE, useReferenceDisks = FALSE, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE ) avworkflow_stop( submissionId = NULL, namespace = avworkspace_namespace(), name = avworkspace_name(), dry = TRUE ) avworkflow_info( submissionId = NULL, namespace = avworkspace_namespace(), name = avworkspace_name() )
namespace |
|
name |
|
submissionId |
a character() of workflow submission ids, or a
tibble with column |
workflowId |
a character(1) of internal identifier associated with one workflow in the submission, or NULL / missing. |
bucket |
character(1) DEFUNCT - name of the google bucket in
which the workflow products are available, as |
destination |
character(1) file path to the location where
files will be synchronized. For directories in the current
working directory, be sure to prepend with |
type |
character(1) copy |
dry |
logical(1) when |
config |
a |
entityName |
character(1) or NULL name of the set of samples to be used when running the workflow. NULL indicates that no sample set will be used. |
entityType |
character(1) or NULL type of root entity used for the workflow. NULL means that no root entity will be used. |
deleteIntermediateOutputFiles |
logical(1) whether or not to delete intermediate output files when the workflow completes. |
useCallCache |
logical(1) whether or not to read from cache for this submission. |
useReferenceDisks |
logical(1) whether or not to use pre-built
disks for common genome references. Default: |
For avworkflow_files()
, the submissionId
is the
identifier associated with the submission of one (or more)
workflows, and is present in the return value of
avworkflow_jobs()
; the example illustrates how the first row
of avworkflow_jobs()
(i.e., the most recently completed
workflow) can be used as input to avworkflow_files()
. When
submissionId
is not provided, the return value is for the
most recently submitted workflow of the namespace and name of
avworkspace()
.
avworkflow_localize()
. type = "control"
files
summarize workflow progress; they can be numerous but are
frequently small and quickly syncronized. type = "output"
files are the output products of the workflow stored in the
workspace bucket. Depending on the workflow, outputs may be
large, e.g., aligned reads in bam files. See avcopy()
to
copy individual files from the bucket to the local drive.
avworkflow_localize()
treats submissionId=
in the same way as
avworkflow_files()
: when missing, files from the most recent
workflow job are candidates for localization.
avworkflows()
returns a tibble. Each workflow is in a
'namespace' and has a 'name', as illustrated in the
example. Columns are
name: workflow name.
namespace: workflow namespace (often the same as the workspace namespace).
rootEntityType: name of the avtable()
used to retrieve inputs.
methodRepoMethod.methodUri: source of the method, e.g., a dockstore URI.
methodRepoMethod.sourceRepo: source repository, e.g., dockstore.
methodRepoMethod.methodPath: path to method, e.g., a dockerstore method might reference a github repository.
methodRepoMethod.methodVersion: the version of the method, e.g., 'main' branch of a github repository.
avworkflow_files()
returns a tibble with columns
file: character() 'base name' of the file in the bucket.
workflow: character() name of the workflow the file is associated with.
task: character() name of the task in the workflow that generated the file.
path: charcter() full path to the file in the google bucket.
submissionId: character() internal identifier associated with the submission the files belong to.
workflowId: character() internal identifer associated with each workflow (e.g., row of an avtable() used as input) in the submission.
submissionRoot: character() path in the workspace bucket to the root of files created by this submission.
namespace: character() AnVIL workspace namespace (billing account) associated with the submissionId.
name: character(1) AnVIL workspace name associated with the submissionId.
avworkflow_localize()
prints a message indicating the
number of files that are (if dry = FALSE
) or would be
localized. If no files require localization (i.e., local files
are not older than the bucket files), then no files are
localized. avworkflow_localize()
returns a tibble of file
name and bucket path of files to be synchronized.
avworkflow_run()
returns config
, invisibly.
avworkflow_stop()
returns (invisibly) TRUE
on
successfully requesting that the workflow stop, FALSE
if the
workflow is already aborting, aborted, or done.
avworkflow_info()
returns a tibble with columns:
submissionId, workflowId, workflowName,status, start, end,
inputs and outputs.
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) ## from within AnVIL avworkflows() %>% select(namespace, name) if (has_avworkspace(strict = TRUE, platform = gcp())) { ## e.g., from within AnVIL jobs <- avworkflow_jobs() if (nrow(jobs)) { jobs |> ## select most recent workflow head(1) |> ## find paths to output and log files on the bucket avworkflow_files() } } if (has_avworkspace(strict = TRUE, platform = gcp())) avworkflow_localize(dry = TRUE) if (has_avworkspace(strict = TRUE, platform = gcp()) && interactive()) { entityName <- avtable("participant_set") |> pull(participant_set_id) |> head(1) avworkflow_run(new_config, entityName) } if (has_avworkspace(strict = TRUE, platform = gcp()) && interactive()) { avworkflow_stop() } if (has_avworkspace(strict = TRUE, platform = gcp())) avworkflow_info()
library(AnVILBase) if (has_avworkspace(strict = TRUE, platform = gcp())) ## from within AnVIL avworkflows() %>% select(namespace, name) if (has_avworkspace(strict = TRUE, platform = gcp())) { ## e.g., from within AnVIL jobs <- avworkflow_jobs() if (nrow(jobs)) { jobs |> ## select most recent workflow head(1) |> ## find paths to output and log files on the bucket avworkflow_files() } } if (has_avworkspace(strict = TRUE, platform = gcp())) avworkflow_localize(dry = TRUE) if (has_avworkspace(strict = TRUE, platform = gcp()) && interactive()) { entityName <- avtable("participant_set") |> pull(participant_set_id) |> head(1) avworkflow_run(new_config, entityName) } if (has_avworkspace(strict = TRUE, platform = gcp()) && interactive()) { avworkflow_stop() } if (has_avworkspace(strict = TRUE, platform = gcp())) avworkflow_info()
avworkspace_namespace()
and avworkspace_name()
are utiliity
functions to retrieve workspace namespace and name from environment
variables or interfaces usually available in AnVIL notebooks or RStudio
sessions. avworkspace()
provides a convenient way to specify workspace
namespace and name in a single command. avworkspace_clone()
clones
(copies) an existing workspace, possibly into a new namespace (billing
account).
## S4 method for signature 'gcp' avworkspaces(..., platform = cloud_platform()) ## S4 method for signature 'gcp' avworkspace_namespace( namespace = NULL, warn = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avworkspace_name(name = NULL, warn = TRUE, ..., platform = cloud_platform()) ## S4 method for signature 'gcp' avworkspace(workspace = NULL, ..., platform = cloud_platform()) ## S4 method for signature 'gcp' avworkspace_clone( namespace = avworkspace_namespace(), name = avworkspace_name(), to_namespace = namespace, to_name, storage_region = "US", bucket_location, ..., platform = cloud_platform() )
## S4 method for signature 'gcp' avworkspaces(..., platform = cloud_platform()) ## S4 method for signature 'gcp' avworkspace_namespace( namespace = NULL, warn = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avworkspace_name(name = NULL, warn = TRUE, ..., platform = cloud_platform()) ## S4 method for signature 'gcp' avworkspace(workspace = NULL, ..., platform = cloud_platform()) ## S4 method for signature 'gcp' avworkspace_clone( namespace = avworkspace_namespace(), name = avworkspace_name(), to_namespace = namespace, to_name, storage_region = "US", bucket_location, ..., platform = cloud_platform() )
... |
additional arguments passed as-is to the |
platform |
|
namespace |
|
warn |
logical(1) when |
name |
|
workspace |
when present, a |
to_namespace |
character(1) workspace (billing account) in which to make the clone. |
to_name |
character(1) name of the cloned workspace. |
storage_region |
character(1) region (NO multi-region, except the default) in which bucket attached to the workspace should be created. |
bucket_location |
character(1) DEFUNCT; use |
avworkspace_namespace()
is the billing account. If the
namespace=
argument is not provided, try gcloud_project()
,
and if that fails try Sys.getenv("WORKSPACE_NAMESPACE")
.
avworkspace_name()
is the name of the workspace as it appears in
https://app.terra.bio/#workspaces. If not provided,
avworkspace_name()
tries to use Sys.getenv("WORKSPACE_NAME")
.
Namespace and name values are cached across sessions, so explicitly
providing avworkspace_name*()
is required at most once per
session. Revert to system settings with arguments NA
.
avworkspace_namespace()
, and avworkspace_name()
return
character(1)
identifiers. avworkspace()
returns the
character(1) concatenated namespace and name. The value
returned by avworkspace_name()
will be percent-encoded (e.g.,
spaces " "
replaced by "%20"
).
avworkspace_clone()
returns the namespace and name, in
the format namespace/name
, of the cloned workspace.
avworkspaces(gcp)
: list workspaces in the current project as a
tibble
avworkspace_namespace(gcp)
: Get or set the namespace of the current
workspace
avworkspace_name(gcp)
: Get or set the name of the current workspace
avworkspace(gcp)
: Get the current workspace namespace and name
combination
avworkspace_clone(gcp)
: Clone the current workspace
if (has_avworkspace(platform = gcp())) { avworkspaces() avworkspace_namespace() avworkspace_name() avworkspace() }
if (has_avworkspace(platform = gcp())) { avworkspaces() avworkspace_namespace() avworkspace_name() avworkspace() }
drs_hub()
resolves zero or more DRS URLs to their Google
bucket location using the DRS Hub API endpoint.
drs_hub(source = character())
drs_hub(source = character())
source |
|
drs_hub()
returns a tbl with the following columns:
drs
: character()
DRS URIs
bucket
: character()
Google cloud bucket
name
: character()
object name in bucket
size
: numeric()
object size in bytes
timeCreated
: character()
object creation time
timeUpdated
: character()
object update time
fileName
: character()
local file name
accessUrl
: character()
signed URL for object access
drs_hub()
uses the DRS Hub API endpoint to resolve a single or multiple DRS
URLs to their Google bucket location. The DRS Hub API endpoint requires a
gcloud_access_token()
. The DRS Hub API service is hosted at
https://drshub.dsde-prod.broadinstitute.org.
if (gcloud_exists() && interactive()) { drs_urls <- c( "drs://drs.anv0:v2_b3b815c7-b012-37b8-9866-1cb44b597924", "drs://drs.anv0:v2_2823eac3-77ae-35e4-b674-13dfab629dc5", "drs://drs.anv0:v2_c6077800-4562-30e3-a0ff-aa03a7e0e24f" ) drs_hub(drs_urls) }
if (gcloud_exists() && interactive()) { drs_urls <- c( "drs://drs.anv0:v2_b3b815c7-b012-37b8-9866-1cb44b597924", "drs://drs.anv0:v2_2823eac3-77ae-35e4-b674-13dfab629dc5", "drs://drs.anv0:v2_c6077800-4562-30e3-a0ff-aa03a7e0e24f" ) drs_hub(drs_urls) }
These functions invoke the gcloud
command line
utility. See gsutil for details on how gcloud
is
located.
gcloud_exists()
tests whether the gcloud()
command
can be found on this system. After finding the binary location,
it runs gcloud version
to identify potentially misconfigured
installations. See 'Details' section of gsutil
for where the
application is searched.
gcloud_account()
: report the current gcloud account
via gcloud config get-value account
.
gcloud_project()
: report the current gcloud project
via gcloud config get-value project
.
gcloud_help()
: queries gcloud
for help for a
command or sub-comand via gcloud help ...
.
gcloud_cmd()
allows arbitrary gcloud
command
execution via gcloud ...
. Use pre-defined functions in
preference to this.
gcloud_storage()
allows arbitrary gcloud storage
command
execution via gcloud storage ...
. Typically used for bucket management
commands such as rm
and cp
.
gcloud_storage_buckets()
provides an interface to the
gcloud storage buckets
command. This command can be used to create a new
bucket via gcloud storage buckets create ...
.
gcloud_exists() gcloud_account(account = NULL) gcloud_project(project = NULL) gcloud_help(...) gcloud_cmd(cmd, ...) gcloud_storage(cmd, ...) gcloud_storage_buckets(bucket_cmd = "create", bucket, ...)
gcloud_exists() gcloud_account(account = NULL) gcloud_project(project = NULL) gcloud_help(...) gcloud_cmd(cmd, ...) gcloud_storage(cmd, ...) gcloud_storage_buckets(bucket_cmd = "create", bucket, ...)
account |
character(1) Google account (e.g., |
project |
character(1) billing project name. |
... |
Additional arguments appended to |
cmd |
|
bucket_cmd |
|
bucket |
|
gcloud_exists()
returns TRUE
when the gcloud
application can be found, FALSE otherwise.
gcloud_account()
returns a character(1)
vector
containing the active gcloud account, typically a gmail email
address.
gcloud_project()
returns a character(1)
vector
containing the active gcloud project.
gcloud_help()
returns an unquoted character()
vector
representing the text of the help manual page returned by
gcloud help ...
.
gcloud_cmd()
returns a character()
vector representing
the text of the output of gcloud cmd ...
gcloud_exists() if (has_avworkspace(platform = gcp())) gcloud_account() if (has_avworkspace(platform = gcp())) gcloud_help()
gcloud_exists() if (has_avworkspace(platform = gcp())) gcloud_account() if (has_avworkspace(platform = gcp())) gcloud_help()
gcloud_access_token()
generates a token for the given service
account. The token is cached for the duration of its validity. The token is
refreshed when it expires. The token is obtained using the gcloud
command
line utility for the given gcloud_account()
. The function is mainly used
internally by API service functions, e.g., AnVIL::Terra()
gcloud_access_token(service)
gcloud_access_token(service)
service |
character(1) The name of the service, e.g. "terra" for which to obtain an access token for. |
gcloud_access_token()
returns a simple token string to be used with
the given service.
if (has_avworkspace(platform = gcp())) gcloud_access_token("rawls") |> httr2::obfuscate()
if (has_avworkspace(platform = gcp())) gcloud_access_token("rawls") |> httr2::obfuscate()
avcopy()
: copy contents of source
to destination
. At
least one of source
or destination
must be Google cloud bucket;
source
can be a character vector with length greater than 1. Use
gsutil_help("cp")
for gsutil
help.
avlist()
: List contents of a google cloud bucket or, if source
is
missing, all Cloud Storage buckets under your default project ID
avremove()
: remove contents of a Google Cloud Bucket.
avbackup()
,avrestore()
: synchronize a source and a destination. If the
destination is on the local file system, it must be a directory or not yet
exist (in which case a directory will be created).
avstorage()
returns the workspace bucket, i.e., the google bucket
associated with a workspace. Bucket content can be visualized under the
'DATA' tab, 'Files' item.
avworkspaces()
: returns a tibble with columns including the name, last
modification time, namespace, and owner status.
avtable_import()
: returns a tibble()
containing the page number, 'from'
and 'to' rows included in the page, job identifier, initial status of the
uploaded 'chunks', and any (error) messages generated during status check.
Use avtable_import_status()
to query current status.
## S4 method for signature 'gcp' avcopy( source, destination, ..., recursive = FALSE, parallel = TRUE, platform = cloud_platform() ) ## S4 method for signature 'gcp' avlist( source = character(), recursive = FALSE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avremove( source, recursive = FALSE, force = FALSE, parallel = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avbackup( source, destination, recursive = FALSE, exclude = NULL, dry = TRUE, delete = FALSE, parallel = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avrestore( source, destination, recursive = FALSE, exclude = NULL, dry = TRUE, delete = FALSE, parallel = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avstorage( namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() )
## S4 method for signature 'gcp' avcopy( source, destination, ..., recursive = FALSE, parallel = TRUE, platform = cloud_platform() ) ## S4 method for signature 'gcp' avlist( source = character(), recursive = FALSE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avremove( source, recursive = FALSE, force = FALSE, parallel = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avbackup( source, destination, recursive = FALSE, exclude = NULL, dry = TRUE, delete = FALSE, parallel = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avrestore( source, destination, recursive = FALSE, exclude = NULL, dry = TRUE, delete = FALSE, parallel = TRUE, ..., platform = cloud_platform() ) ## S4 method for signature 'gcp' avstorage( namespace = avworkspace_namespace(), name = avworkspace_name(), ..., platform = cloud_platform() )
source |
|
destination |
|
... |
additional arguments passed as-is to the |
recursive |
|
parallel |
|
platform |
|
force |
|
exclude |
|
dry |
|
delete |
|
namespace |
|
name |
|
avbackup()': To make
"gs://mybucket/data"match the contents of the local directory
"data"' you could do:
avbackup("data", "gs://mybucket/data", delete = TRUE)
To make the local directory "data" the same as the contents of gs://mybucket/data:
avrestore("gs://mybucket/data", "data", delete = TRUE)
If destination
is a local path and does not exist, it will be
created.
avcopy()
: exit status of avcopy()
, invisibly.
avlist()
: character()
listing of source
content.
avremove()
: exit status of gsutil rm
, invisibly.
avbackup()
: exit status of gsutil rsync
, invisbly.
avrestore()
: exit status of gsutil rsync
, invisbly.
avstorage()
returns a character(1)
bucket identifier prefixed with
gs://
avcopy(gcp)
: copy contents of source
to destination
with
gsutil
avlist(gcp)
: list contents of source
with gsutil
avremove(gcp)
: remove contents of source
with gsutil
avbackup(gcp)
: backup contents of source
with gsutil
avrestore(gcp)
: restore contents of source
with gsutil
avstorage(gcp)
: get the storage bucket location
src <- "gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv" if (has_avworkspace(platform = gcp())) { avcopy(src, tempdir()) ## internal gsutil_*() commands work with spaces in source or destination destination <- file.path(tempdir(), "foo bar") avcopy(src, destination) file.exists(destination) } if (has_avworkspace(strict = TRUE, platform = gcp())) ## From within AnVIL... bucket <- avstorage() # discover bucket if (has_avworkspace(strict = TRUE, platform = gcp()) && interactive()) { path <- file.path(bucket, "mtcars.tab") avlist(dirname(path)) # no 'mtcars.tab'... write.table(mtcars, gsutil_pipe(path, "w")) # write to bucket gsutil_stat(path) # yep, there! read.table(gsutil_pipe(path, "r")) # read from bucket }
src <- "gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv" if (has_avworkspace(platform = gcp())) { avcopy(src, tempdir()) ## internal gsutil_*() commands work with spaces in source or destination destination <- file.path(tempdir(), "foo bar") avcopy(src, destination) file.exists(destination) } if (has_avworkspace(strict = TRUE, platform = gcp())) ## From within AnVIL... bucket <- avstorage() # discover bucket if (has_avworkspace(strict = TRUE, platform = gcp()) && interactive()) { path <- file.path(bucket, "mtcars.tab") avlist(dirname(path)) # no 'mtcars.tab'... write.table(mtcars, gsutil_pipe(path, "w")) # write to bucket gsutil_stat(path) # yep, there! read.table(gsutil_pipe(path, "r")) # read from bucket }
These functions invoke the gsutil
command line
utility. See the "Details:" section if you have gsutil
installed but the package cannot find it.
gsutil_requesterpays()
: does the google bucket
require that the requester pay for access?
gsutil_exists()
: check if the bucket or object
exists.
gsutil_stat()
: print, as a side effect, the status
of a bucket, directory, or file.
gsutil_rsync()
: synchronize a source and a
destination. If the destination is on the local file system, it
must be a directory or not yet exist (in which case a directory
will be created).
gsutil_cat()
: concatenate bucket objects to standard output
gsutil_help()
: print 'man' page for the gsutil
command or subcommand. Note that only commandes documented on this
R help page are supported.
gsutil_pipe()
: create a pipe to read from or write
to a gooogle bucket object.
gsutil_requesterpays(source) gsutil_exists(source) gsutil_stat(source) gsutil_rsync( source, destination, ..., exclude = NULL, dry = TRUE, delete = FALSE, recursive = FALSE, parallel = TRUE ) gsutil_cat(source, ..., header = FALSE, range = integer()) gsutil_help(cmd = character(0)) gsutil_pipe(source, open = "r", ...)
gsutil_requesterpays(source) gsutil_exists(source) gsutil_stat(source) gsutil_rsync( source, destination, ..., exclude = NULL, dry = TRUE, delete = FALSE, recursive = FALSE, parallel = TRUE ) gsutil_cat(source, ..., header = FALSE, range = integer()) gsutil_help(cmd = character(0)) gsutil_pipe(source, open = "r", ...)
source |
|
destination |
|
... |
additional arguments passed as-is to the |
exclude |
|
dry |
|
delete |
|
recursive |
|
parallel |
|
header |
|
range |
(optional) |
cmd |
|
open |
|
The gsutil
system command is required. The search for
gsutil
starts with environment variable GCLOUD_SDK_PATH
providing a path to a directory containing a bin
directory
containingin gsutil
, gcloud
, etc. The path variable is
searched for first as an option()
and then system
variable. If no option or global variable is found,
Sys.which()
is tried. If that fails, gsutil
is searched for
on defined paths. On Windows, the search tries to find
Google\\Cloud SDK\\google-cloud-sdk\\bin\\gsutil.cmd
in the
LOCAL APP DATA
, Program Files
, and Program Files (x86)
directories. On linux / macOS, the search continues with
~/google-cloud-sdk
.
gsutil_rsync()': To make
"gs://mybucket/data"match the contents of the local directory
"data"' you could do:
gsutil_rsync("data", "gs://mybucket/data", delete = TRUE)
To make the local directory "data" the same as the contents of gs://mybucket/data:
gsutil_rsync("gs://mybucket/data", "data", delete = TRUE)
If destination
is a local path and does not exist, it will be
created.
gsutil_requesterpays()
: named logical()
vector TRUE
when requester-pays is enabled.
gsutil_exists()
: logical(1) TRUE if bucket or object exists.
gsutil_stat()
: tibble()
summarizing status of each
bucket member.
gsutil_rsync()
: exit status of gsutil_rsync()
, invisbly.
gsutil_cat()
returns the content as a character vector.
gsutil_help()
: character()
help text for subcommand cmd
.
gsutil_pipe()
an unopened R pipe()
; the mode is
not specified, and the pipe must be used in the
appropriate context (e.g., a pipe created with open = "r"
for
input as read.csv()
)
src <- "gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv" if (has_avworkspace(platform = gcp())) gsutil_requesterpays(src) # FALSE -- no cost download if (has_avworkspace(platform = gcp())) { gsutil_exists(src) gsutil_stat(src) avlist(dirname(src)) } if (has_avworkspace(platform = gcp())) gsutil_help("ls") if (has_avworkspace(platform = gcp())) { df <- read.csv(gsutil_pipe(src), 5L) class(df) dim(df) head(df) }
src <- "gs://genomics-public-data/1000-genomes/other/sample_info/sample_info.csv" if (has_avworkspace(platform = gcp())) gsutil_requesterpays(src) # FALSE -- no cost download if (has_avworkspace(platform = gcp())) { gsutil_exists(src) gsutil_stat(src) avlist(dirname(src)) } if (has_avworkspace(platform = gcp())) gsutil_help("ls") if (has_avworkspace(platform = gcp())) { df <- read.csv(gsutil_pipe(src), 5L) class(df) dim(df) head(df) }
has_avworkspace()
checks that the AnVIL environment is set up
to work with GCP. If strict = TRUE
, it also checks that the workspace
name is set.
## S4 method for signature 'gcp' has_avworkspace(strict = FALSE, ..., platform = cloud_platform())
## S4 method for signature 'gcp' has_avworkspace(strict = FALSE, ..., platform = cloud_platform())
strict |
|
... |
Arguments passed to the methods. |
platform |
A Platform derived class indicating the AnVIL environment,
currently, |
logical(1)
TRUE
if the AnVIL environment is set up properly to
interact with GCP, otherwise FALSE
.
has_avworkspace(gcp)
: Check if the AnVIL environment is set up
has_avworkspace(platform = gcp())
has_avworkspace(platform = gcp())
localize()
: recursively synchronizes files from a
Google storage bucket (source
) to the local file system
(destination
). This command acts recursively on the source
directory, and does not delete files in destination
that are
not in 'source.
delocalize()
: synchronize files from a local file
system (source
) to a Google storage bucket
(destination
). This command acts recursively on the source
directory, and does not delete files in destination
that are
not in source
.
localize(source, destination, dry = TRUE) delocalize(source, destination, unlink = FALSE, dry = TRUE)
localize(source, destination, dry = TRUE) delocalize(source, destination, unlink = FALSE, dry = TRUE)
source |
|
destination |
|
dry |
|
unlink |
|
localize()
: exit status of function gsutil_rsync()
.
delocalize()
: exit status of function gsutil_rsync()