Google Genomics implements the GA4GH variants API and this R package can retrieve data from that implementation. For more detail, see https://cloud.google.com/genomics/v1beta2/reference/variants
library(GoogleGenomics)
# This vignette is authenticated on package load from the env variable GOOGLE_API_KEY.
# When running interactively, call the authenticate method.
# ?authenticate
By default, this function retrieves variants for a small genomic region from the 1,000 Genomes phase 1 variants.
variants <- getVariants()
## Fetching variants page.
## Variants are now available.
length(variants)
## [1] 4
We can see that 4 individual variants were returned and that the JSON response was deserialized into an R list object:
class(variants)
## [1] "list"
mode(variants)
## [1] "list"
The top level field names are:
names(variants[[1]])
## [1] "variantSetId" "id" "names" "created"
## [5] "referenceName" "start" "end" "referenceBases"
## [9] "alternateBases" "quality" "filter" "info"
## [13] "calls"
The variant contains nested calls:
length(variants[[1]]$calls)
## [1] 1092
With top level call field names:
names(variants[[1]]$calls[[1]])
## [1] "callSetId" "callSetName" "genotype"
## [4] "phaseset" "genotypeLikelihood" "info"
And examining a call for a particular variant:
variants[[1]]$referenceName
## [1] "22"
variants[[1]]$start
## [1] "16051452"
variants[[1]]$alternateBases
## [[1]]
## [1] "C"
variants[[1]]$calls[[1]]
## $callSetId
## [1] "10473108253681171589-0"
##
## $callSetName
## [1] "HG00261"
##
## $genotype
## $genotype[[1]]
## [1] 0
##
## $genotype[[2]]
## [1] 0
##
##
## $phaseset
## [1] "*"
##
## $genotypeLikelihood
## $genotypeLikelihood[[1]]
## [1] -0.1
##
## $genotypeLikelihood[[2]]
## [1] -0.67
##
## $genotypeLikelihood[[3]]
## [1] -3.3
##
##
## $info
## $info$DS
## $info$DS[[1]]
## [1] "0.000"
This is good, but this data becomes much more useful when it is converted to Bioconductor data types. For example, we can convert variants in this list representation to VRanges from VariantAnnotation:
variantsToVRanges(variants)
## VRanges object with 4 ranges and 2 metadata columns:
## seqnames ranges strand ref
## <Rle> <IRanges> <Rle> <character>
## rs143503259 chr22 [16051453, 16051453] * A
## rs192339082 chr22 [16051477, 16051477] * C
## rs79725552 chr22 [16051480, 16051480] * T
## rs141578542 chr22 [16051497, 16051497] * A
## alt totalDepth refDepth
## <characterOrRle> <integerOrRle> <integerOrRle>
## rs143503259 C <NA> <NA>
## rs192339082 A <NA> <NA>
## rs79725552 C <NA> <NA>
## rs141578542 G <NA> <NA>
## altDepth sampleNames softFilterMatrix | QUAL
## <integerOrRle> <factorOrRle> <matrix> | <numeric>
## rs143503259 <NA> <NA> | 100
## rs192339082 <NA> <NA> | 100
## rs79725552 <NA> <NA> | 100
## rs141578542 <NA> <NA> | 100
## FILTER
## <character>
## rs143503259 PASS
## rs192339082 PASS
## rs79725552 PASS
## rs141578542 PASS
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
## hardFilters: NULL
Next let’s use package VariantAnnotation to annotate some specific 1,000 Genomes Phase 1 variants.
library(VariantAnnotation)
Note that the parameters start
and end
are expressed in 0-based coordinates per the GA4GH specification but the Bioconductor data type converters in GoogleGenomics, by default, transform the returned data to 1-based coordinates.
granges <- getVariants(datasetId="10473108253681171589",
chromosome="22",
start=50300077,
end=50303000,
converter=variantsToGRanges)
## Fetching variants page.
## Continuing variant query with the nextPageToken: CIWK_hcQ_Nr-8ZGlk78i
## Fetching variants page.
## Continuing variant query with the nextPageToken: CIaM_hcQvM_un46SmcJ0
## Fetching variants page.
## Continuing variant query with the nextPageToken: CJGN_hcQiYna3cSwse_RAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMKQ_hcQr9OUpKu4n_xz
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMKS_hcQn-nIneaph7WJAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKOU_hcQ2eCCwsmBysdd
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLWV_hcQmo-n-tX08uPyAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMqX_hcQ__3H4KKV9tUR
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKuZ_hcQ9pHXwa742bKgAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLqa_hcQ-6zjo_Dc86gy
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLSb_hcQ3vrDo5XJwsB_
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKOd_hcQ6IPgk-PVjskK
## Fetching variants page.
## Continuing variant query with the nextPageToken: CPGe_hcQ-8Saw6Ojg8yXAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CPGf_hcQuKDdodiCvLUm
## Fetching variants page.
## Variants are now available.
Now locate the protein coding variants in each:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
granges_locations <- locateVariants(granges, txdb, CodingVariants())
## 'select()' returned many:1 mapping between keys and columns
granges_locations
## GRanges object with 7 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## 1 chr22 [50301422, 50301422] - | coding 939 939
## 2 chr22 [50301476, 50301476] - | coding 885 885
## 3 chr22 [50301488, 50301488] - | coding 873 873
## 4 chr22 [50301494, 50301494] - | coding 867 867
## 5 chr22 [50301584, 50301584] - | coding 777 777
## 6 chr22 [50302962, 50302962] - | coding 698 698
## 7 chr22 [50302995, 50302995] - | coding 665 665
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 24 75253 218562 79087
## 2 25 75253 218562 79087
## 3 26 75253 218562 79087
## 4 27 75253 218562 79087
## 5 28 75253 218562 79087
## 6 57 75253 218563 79087
## 7 58 75253 218563 79087
## FOLLOWID
## <CharacterList>
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
And predict the effect of the protein coding variants:
library(BSgenome.Hsapiens.UCSC.hg19)
granges_coding <- predictCoding(rep(granges, elementNROWS(granges$ALT)),
txdb,
seqSource=Hsapiens,
varAllele=unlist(granges$ALT, use.names=FALSE))
granges_coding
## GRanges object with 7 ranges and 16 metadata columns:
## seqnames ranges strand | REF
## <Rle> <IRanges> <Rle> | <DNAStringSet>
## rs114335781 chr22 [50301422, 50301422] - | G
## rs8135963 chr22 [50301476, 50301476] - | T
## rs200080075 chr22 [50301488, 50301488] - | C
## rs147801200 chr22 [50301494, 50301494] - | G
## rs138060012 chr22 [50301584, 50301584] - | C
## rs114264124 chr22 [50302962, 50302962] - | C
## rs149209714 chr22 [50302995, 50302995] - | C
## ALT QUAL FILTER varAllele
## <DNAStringSetList> <numeric> <character> <DNAStringSet>
## rs114335781 A 100 PASS T
## rs8135963 C 100 PASS G
## rs200080075 T 100 PASS A
## rs147801200 A 100 PASS T
## rs138060012 T 100 PASS A
## rs114264124 T 100 PASS A
## rs149209714 G 100 PASS C
## CDSLOC PROTEINLOC QUERYID TXID CDSID
## <IRanges> <IntegerList> <integer> <character> <IntegerList>
## rs114335781 [939, 939] 313 24 75253 218562
## rs8135963 [885, 885] 295 25 75253 218562
## rs200080075 [873, 873] 291 26 75253 218562
## rs147801200 [867, 867] 289 27 75253 218562
## rs138060012 [777, 777] 259 28 75253 218562
## rs114264124 [698, 698] 233 57 75253 218563
## rs149209714 [665, 665] 222 58 75253 218563
## GENEID CONSEQUENCE REFCODON VARCODON
## <character> <factor> <DNAStringSet> <DNAStringSet>
## rs114335781 79087 synonymous ATC ATT
## rs8135963 79087 synonymous GCA GCG
## rs200080075 79087 synonymous CCG CCA
## rs147801200 79087 synonymous CAC CAT
## rs138060012 79087 synonymous CCG CCA
## rs114264124 79087 nonsynonymous CGG CAG
## rs149209714 79087 nonsynonymous GGA GCA
## REFAA VARAA
## <AAStringSet> <AAStringSet>
## rs114335781 I I
## rs8135963 A A
## rs200080075 P P
## rs147801200 H H
## rs138060012 P P
## rs114264124 R Q
## rs149209714 G A
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Finally, add gene information:
library(org.Hs.eg.db)
annots <- select(org.Hs.eg.db,
keys=granges_coding$GENEID,
keytype="ENTREZID",
columns=c("SYMBOL", "GENENAME","ENSEMBL"))
## 'select()' returned many:1 mapping between keys and columns
cbind(elementMetadata(granges_coding), annots)
## DataFrame with 7 rows and 20 columns
## REF ALT QUAL FILTER varAllele
## <DNAStringSet> <DNAStringSetList> <numeric> <character> <DNAStringSet>
## 1 G A 100 PASS T
## 2 T C 100 PASS G
## 3 C T 100 PASS A
## 4 G A 100 PASS T
## 5 C T 100 PASS A
## 6 C T 100 PASS A
## 7 C G 100 PASS C
## CDSLOC PROTEINLOC QUERYID TXID CDSID GENEID
## <IRanges> <IntegerList> <integer> <character> <IntegerList> <character>
## 1 [939, 939] 313 24 75253 218562 79087
## 2 [885, 885] 295 25 75253 218562 79087
## 3 [873, 873] 291 26 75253 218562 79087
## 4 [867, 867] 289 27 75253 218562 79087
## 5 [777, 777] 259 28 75253 218562 79087
## 6 [698, 698] 233 57 75253 218563 79087
## 7 [665, 665] 222 58 75253 218563 79087
## CONSEQUENCE REFCODON VARCODON REFAA VARAA
## <factor> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
## 1 synonymous ATC ATT I I
## 2 synonymous GCA GCG A A
## 3 synonymous CCG CCA P P
## 4 synonymous CAC CAT H H
## 5 synonymous CCG CCA P P
## 6 nonsynonymous CGG CAG R Q
## 7 nonsynonymous GGA GCA G A
## ENTREZID SYMBOL GENENAME
## <character> <character> <character>
## 1 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 2 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 3 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 4 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 5 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 6 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 7 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## ENSEMBL
## <character>
## 1 ENSG00000182858
## 2 ENSG00000182858
## 3 ENSG00000182858
## 4 ENSG00000182858
## 5 ENSG00000182858
## 6 ENSG00000182858
## 7 ENSG00000182858
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.3.0
## [2] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [3] BSgenome_1.40.0
## [4] rtracklayer_1.32.0
## [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [6] GenomicFeatures_1.24.2
## [7] AnnotationDbi_1.34.2
## [8] GoogleGenomics_1.4.2
## [9] VariantAnnotation_1.18.1
## [10] GenomicAlignments_1.8.0
## [11] Rsamtools_1.24.0
## [12] Biostrings_2.40.0
## [13] XVector_0.12.0
## [14] SummarizedExperiment_1.2.2
## [15] Biobase_2.32.0
## [16] GenomicRanges_1.24.0
## [17] GenomeInfoDb_1.8.2
## [18] IRanges_2.6.0
## [19] S4Vectors_0.10.0
## [20] BiocGenerics_0.18.0
## [21] knitr_1.13
## [22] BiocStyle_2.0.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 formatR_1.4 bitops_1.0-6
## [4] tools_3.3.0 zlibbioc_1.18.0 biomaRt_2.28.0
## [7] digest_0.6.9 jsonlite_0.9.20 evaluate_0.9
## [10] RSQLite_1.0.0 DBI_0.4-1 curl_0.9.7
## [13] yaml_2.1.13 stringr_1.0.0 httr_1.1.0
## [16] R6_2.1.2 XML_3.98-1.4 BiocParallel_1.6.2
## [19] rmarkdown_0.9.6 magrittr_1.5 htmltools_0.3.5
## [22] stringi_1.0-1 RCurl_1.95-4.8 rjson_0.2.15