The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading institutions working in healthcare, research, disease advocacy, life science, and information technology. The Data Working Group of the GA4GH developed data model schemas and application program interfaces (APIs) for genomic data. These APIs are specifically designed to allow sharing of genomics data in a standardized manner and without having to exchange complete experiments. They developed a reference implementation for these APIs providing a web server for hosting genomic data.
The Bioconductor project provides tools for the analysis and comprehension of high-throughput genomic data. It uses the R statistical programming language, and is open source and open development. The Bioconductor provides stable representations for genomics data such as biological sequences and genomic variants.
We developed GA4GHclient, a Bioconductor package that provides an easy way to access public data servers through GA4GH APIs. The requested output data are converted into tidy data and presented as data frames. Our package provides methods for converting genomic variant data into VCF data allowing the creation of complex data analysis using public genomic data and well validated Bioconductor packages. This package also provides an interactive web application for interacting with genomics data through GA4GH APIs.
Public data servers that use GA4GH Genomics API:
The table below shows all available methods in GA4GHclient package. These methods are based on the official GA4GH schemas. Let us know if some method is missing by opening an issue at https://github.com/labbcb/GA4GHclient/issues.
Method | Description |
---|---|
searchDatasets | Search for datasets |
searchReferenceSets | Search for reference sets (reference genomes) |
searchReferences | Search for references (genome sequences, e.g. chromosomes) |
listReferenceBases | Get the sequence bases of a reference genome by genomic range |
searchVariantSets | Search for for variant sets (VCF files) |
searchVariants | Search for variants by genomic ranges (lines of VCF files) |
searchCallSets | Search for call sets (sample columns of VCF files) |
searchVariantAnnotationSets | Search for variant annotation sets (annotated VCF files) |
searchVariantAnnotations | Search for annotated variants by genomic range |
searchFeatureSets | Search for feature sets (genomic features, e.g. GFF files) |
searchFeatures | Search for features (lines of genomic feature files) |
searchReadGroupSets | Search for read group sets (sequence alignement, e.g BAM files) |
searchReads | Search for reads by genomic range (bases of aligned sequences) |
searchBiosamples | Search for biosamples |
searchIndividuals | Search for individuals |
searchRnaQuantificationSets | Search for RNA quantification sets |
searchRnaQuantifications | Search for RNA quantifications |
searchExpressionLevels | Search for expression levels |
searchPhenotypeAssociationSets | Search for phenotype association sets |
searchPhenotypeAssociations | Search for phenotype associations |
getDataset | Get a dataset by its ID |
getReferenceSet | Get a reference set by its ID |
getReference | Get a reference by its ID |
getVariantSet | Get a variant set by its ID |
getVariant | Get a variant by its ID with all call sets for this variant |
getCallSet | Get a call set by its ID |
getVariantAnnotationSet | Get a variant annotation set by its ID |
getVariantAnnotation | Get an annotated variant by its ID |
getFeatureSet | Get a feature set by its ID |
getFeature | Get a feature by its ID |
getReadGroupSet | Get a read group set by its ID |
getBiosample | Get a biosample by its ID |
getIndividual | Get an individual by its ID |
getRnaQuantificationSet | Get an RNA quantification set by its ID |
getRnaQuantification | Get an RNA quantification by its ID |
getExpressionLevel | Get an expression level by its ID |
First, load required packages for this vignette.
library(GA4GHclient)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(VariantAnnotation)
For this example we will use Hosting Thousand Genomes Project server. It contains data from the 1000 Genomes project. Set the URL of GA4GH API data server. This variable will be used by all request functions.
host <- "http://1kgenomes.ga4gh.org/"
Get basic information about data server. The dataset may contain many variant sets. A variant set represents the header of an VCF file. A collection of variants represent lines of an VCF file. The nrow
attribute define the amount of entries we want to get from the server. For the dataset and variant set we want only the first entry.
The searchVariants
function requests genomic variant data from the server. It uses genomic intervals to retrieve these varants. As R programming language, the genomic indice is 1-based. If you want to see what reference names (e.g. chromosomes) are available run the searchReferences
function.
datasets <- searchDatasets(host, nrows = 1)
datasetId <- datasets$id
variantSets <- searchVariantSets(host, datasetId, nrows = 1)
variantSetId <- variantSets$id
variants <- searchVariants(host, variantSetId, referenceName = "1",
start = 15000, end = 16000, nrows = 10)
variants
## class: CollapsedVCF
## dim: 10 0
## rowRanges(vcf):
## GRanges with 3 metadata columns: ID, REF, ALT
## info(vcf):
## DataFrame with 14 columns: EUR_AF, SAS_AF, AC, AA, AF, AFR_AF, AMR_AF...
## info(header(vcf)):
## Number Type Description
## EUR_AF A Float Allele frequency in the EUR populations...
## SAS_AF A Float Allele frequency in the SAS populations...
## AC A Integer Total number of alternate alleles in ca...
## AA 1 String Ancestral Allele. Format: AA|REF|ALT|In...
## AF A Float Estimated allele frequency in the range...
## AFR_AF A Float Allele frequency in the AFR populations...
## AMR_AF A Float Allele frequency in the AMR populations...
## AN 1 Integer Total number of alleles in called genot...
## VT . String indicates what type of variant the line...
## EAS_AF A Float Allele frequency in the EAS populations...
## NS 1 Integer Number of samples with data
## EX_TARGET 0 Flag indicates whether a variant is within t...
## DP 1 Integer Total read depth; only low coverage dat...
## MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
## geno(vcf):
## SimpleList of length 0:
Almost search functions will return an DataFrame
object from S4Vector package. The searchVariants
and getVariant
functions will return an VCF
object with header from VariantAnnotation package. The getVariantSet
function will return an VCFHeader
object. These three functions will return DataFrame
object when asVCF
or asVCFHeader
parameters be FALSE
.
variants.df <- searchVariants(host, variantSetId, referenceName = "1",
start = 15000, end = 16000, nrows = 10, asVCF = FALSE)
DT::datatable(as.data.frame(variants.df), options = list(scrollX = TRUE))
Due to internet connection, there is a limit of amount for response data imposed by the server (or the responseSize
parameter). By default the function will make requests until get all response data. Increasing the value of the responseSize
parameter will reduce the number os requests to server.
Bioconductor has annotation packages for many genomes. For example, the TxDb.Hsapiens.UCSC.hg19.knownGene pakage exposes an annotation databases generated from UCSC by exposing these as TxDb objects. Before using annotation packages, it is very important to check what version of reference genome was used by the GA4GH API-based data server. In other words, what reference genome was used to align sequencing reads and call for variants. This information can be accessed via searchReferenceSets
function.
referenceSets <- searchReferenceSets(host, nrows = 1)
referenceSets$name
In this case, the version of reference genome is NCBI37, which is compatible with TxDb.Hsapiens.UCSC.hg19.knownGene.
Get name of all available genes.
head(keys(org.Hs.eg.db, keytype = "SYMBOL"))
## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
Get genomic location of all genes.
head(genes(TxDb.Hsapiens.UCSC.hg19.knownGene))
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 [ 58858172, 58874214] - | 1
## 10 chr8 [ 18248755, 18258723] + | 10
## 100 chr20 [ 43248163, 43280376] - | 100
## 1000 chr18 [ 25530930, 25757445] - | 1000
## 10000 chr1 [243651535, 244006886] - | 10000
## 100008586 chrX [ 49217763, 49233491] + | 100008586
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Use case: search for variants located in SCN1A gene.
df <- select(org.Hs.eg.db, keys = "SCN1A", columns = c("SYMBOL", "ENTREZID"), keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
gr <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene, filter=list(gene_id=df$ENTREZID))
seqlevelsStyle(gr) <- "NCBI"
variants.scn1a <- searchVariantsByGRanges(host, variantSetId, gr, asVCF = FALSE)
DT::datatable(as.data.frame(variants.scn1a[[1]]), options = list(scrollX = TRUE))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
Use case: locate annotation data for a group of variants located at SCN1A gene.
variants.scn1a.gr <- makeGRangesFromDataFrame(variants.scn1a[[1]],
seqnames.field = "referenceName")
seqlevelsStyle(variants.scn1a.gr) <- "UCSC"
locateVariants(variants.scn1a.gr, TxDb.Hsapiens.UCSC.hg19.knownGene,
CodingVariants())
## 'select()' returned many:1 mapping between keys and columns
## GRanges object with 308 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART
## <Rle> <IRanges> <Rle> | <factor> <integer>
## 1 chr2 [166847834, 166847834] - | coding 5867
## 2 chr2 [166847834, 166847834] - | coding 5918
## 3 chr2 [166847834, 166847834] - | coding 5951
## 4 chr2 [166847921, 166847921] - | coding 5780
## 5 chr2 [166847921, 166847921] - | coding 5831
## ... ... ... ... . ... ...
## 304 chr2 [166930064, 166930064] - | coding 68
## 305 chr2 [166930064, 166930064] - | coding 68
## 306 chr2 [166930078, 166930078] - | coding 54
## 307 chr2 [166930078, 166930078] - | coding 54
## 308 chr2 [166930078, 166930078] - | coding 54
## LOCEND QUERYID TXID CDSID GENEID
## <integer> <integer> <character> <IntegerList> <character>
## 1 5867 51 12221 37217 6323
## 2 5918 51 12222 37217 6323
## 3 5951 51 12223 37217 6323
## 4 5780 52 12221 37217 6323
## 5 5831 52 12222 37217 6323
## ... ... ... ... ... ...
## 304 68 2159 12222 37244 6323
## 305 68 2159 12223 37244 6323
## 306 54 2160 12221 37244 6323
## 307 54 2160 12222 37244 6323
## 308 54 2160 12223 37244 6323
## PRECEDEID FOLLOWID
## <CharacterList> <CharacterList>
## 1
## 2
## 3
## 4
## 5
## ... ... ...
## 304
## 305
## 306
## 307
## 308
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The searchVariants
and getVariant
functions return VCF
objects. These VCF
objects contains the VCF header. This is an example of how to get variants with call data. We can get calls running the searchCallSets
function. After convertion, it can be written into disk as VCF file using writeVcf
.
callSetIds <- searchCallSets(host, variantSetId, nrows = 5)$id
vcf <- searchVariants(host, variantSetId, referenceName = "1",
start = 15000, end = 16000, callSetIds = callSetIds, nrows = 10)
vcf
## class: CollapsedVCF
## dim: 10 5
## rowRanges(vcf):
## GRanges with 3 metadata columns: ID, REF, ALT
## info(vcf):
## DataFrame with 14 columns: EUR_AF, SAS_AF, AC, AA, AF, AFR_AF, AMR_AF...
## info(header(vcf)):
## Number Type Description
## EUR_AF A Float Allele frequency in the EUR populations...
## SAS_AF A Float Allele frequency in the SAS populations...
## AC A Integer Total number of alternate alleles in ca...
## AA 1 String Ancestral Allele. Format: AA|REF|ALT|In...
## AF A Float Estimated allele frequency in the range...
## AFR_AF A Float Allele frequency in the AFR populations...
## AMR_AF A Float Allele frequency in the AMR populations...
## AN 1 Integer Total number of alleles in called genot...
## VT . String indicates what type of variant the line...
## EAS_AF A Float Allele frequency in the EAS populations...
## NS 1 Integer Number of samples with data
## EX_TARGET 0 Flag indicates whether a variant is within t...
## DP 1 Integer Total read depth; only low coverage dat...
## MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
## geno(vcf):
## SimpleList of length 1: GT
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
header(vcf)
## class: VCFHeader
## samples(0):
## meta(0):
## fixed(0):
## info(27): CIEND CIPOS ... EX_TARGET MULTI_ALLELIC
## geno(1): GT
The vcf
object works as expected. We can get the genotype data for example. More information about working with VCF data see vignette("VariantAnnotation")
.
geno(vcf)$GT
## HG00096 HG00097 HG00099 HG00100 HG00101
## [1,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [2,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [3,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [4,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [5,] "0|0" "0|0" "1|0" "0|0" "0|0"
## [6,] "0|0" "0|0" "0|0" "0|0" "1|2"
## [7,] "0|0" "0|0" "0|0" "0|0" "2|2"
## [8,] "0|0" "0|0" "0|0" "0|0" "2|2"
## [9,] "0|0" "0|0" "0|0" "0|0" "1|2"
## [10,] "0|0" "0|0" "0|0" "0|0" "1|2"
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] VariantAnnotation_1.22.0
## [2] Rsamtools_1.28.0
## [3] Biostrings_2.44.0
## [4] XVector_0.16.0
## [5] SummarizedExperiment_1.6.0
## [6] DelayedArray_0.2.0
## [7] matrixStats_0.52.2
## [8] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [9] GenomicFeatures_1.28.0
## [10] GenomicRanges_1.28.0
## [11] GenomeInfoDb_1.12.0
## [12] org.Hs.eg.db_3.4.1
## [13] AnnotationDbi_1.38.0
## [14] IRanges_2.10.0
## [15] Biobase_2.36.0
## [16] GA4GHclient_1.0.0
## [17] S4Vectors_0.14.0
## [18] BiocGenerics_0.22.0
## [19] BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 compiler_3.4.0
## [3] highr_0.6 bitops_1.0-6
## [5] tools_3.4.0 zlibbioc_1.22.0
## [7] biomaRt_2.32.0 digest_0.6.12
## [9] jsonlite_1.4 tibble_1.3.0
## [11] BSgenome_1.44.0 evaluate_0.10
## [13] RSQLite_1.1-2 memoise_1.1.0
## [15] lattice_0.20-35 Matrix_1.2-9
## [17] DBI_0.6-1 yaml_2.1.14
## [19] GenomeInfoDbData_0.99.0 httr_1.2.1
## [21] dplyr_0.5.0 rtracklayer_1.36.0
## [23] stringr_1.2.0 knitr_1.15.1
## [25] htmlwidgets_0.8 DT_0.2
## [27] rprojroot_1.2 grid_3.4.0
## [29] R6_2.2.0 XML_3.98-1.6
## [31] BiocParallel_1.10.0 rmarkdown_1.4
## [33] purrr_0.2.2 magrittr_1.5
## [35] GenomicAlignments_1.12.0 backports_1.0.5
## [37] htmltools_0.3.5 assertthat_0.2.0
## [39] stringi_1.1.5 RCurl_1.95-4.8