1 User instruction

This package includes 2 vignettes. This One is about the class definitions, basic methods and operations of VariantExperiment that are defined or directly inherited from SummarizedExperiment. For users who want to apply VariantExperiment directly to their analysis high-throughput genotyping or DNA-seq data sets, please refer to the other vignette.

2 Introduction

As the sequencing data (DNA-seq, RNA-seq, single cell RNA-seq…) gets increasingly larger-profile, the memory space in R has been an obstable for fast and efficient data processing, because most available R or Bioconductor packages are developed based on in-memory data manipulation. SingleCellExperiment has achieved efficient on-disk saving/reading of the large-scale count data as HDF5Array objects. However, there was still no such light-weight containers available for high-throughput DNA-seq / genotyping data.

We have developed VariantExperiment, a Bioconductor package for saving the VCF/GDS format data into RangedSummarizedExperiment object. The high-throughput genetic/genomic data are saved in GDSArray objects, a direct extension of DelayedArray with GDS back-end. In addition to the light-weight Assay data, We have also realized the on-disk saving of annotation data for both features/samples (corresponding to rowData/colData) by implementing the DelayedDataFrame data structure. The on-disk representation of both assay data and annotation data realizes on-disk reading and processing and saves R memory space significantly. The interface of RangedSummarizedExperiment data format enables easy and common manipulations for high-throughput genetic/genomic data with common SummarizedExperiment metaphor in R and Bioconductor.

3 Installation

  1. Download the package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VariantExperiment")

Or install the development version of the package from Github.

BiocManager::install(“Bioconductor/VariantExperiment”) 
  1. Load the package into R session.
library(VariantExperiment)

4 Background

4.1 GDSArray

GDSArray is a Bioconductor package that represents GDS files as objects derived from the DelayedArray package and DelayedArray class. It converts GDS nodes into a DelayedArray-derived data structure. The rich common methods and data operations defined on GDSArray makes it more R-user-friendly than working with the GDS file directly. The array data from GDS files are always returned with the first dimension being features (genes/variants/snps) and the second dimension being samples. This feature is consistent with the assay data saved in SummarizedExperiment, and makes the GDSArray package more interoperable with other established Bioconductor data infrastructure.

The GDSArray() constructor takes 2 arguments: the file path and the GDS node name inside the GDS file.

gds <- SeqArray::seqExampleFileName("gds")
GDSArray(gds, "genotype/data")
## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1       3       3       0       3   .       0       0       0       0
##    2       3       3       0       3   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348       3       3       0       3   .       3       3       3       3
## 
## ,,2
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1       3       3       0       3   .       0       0       0       0
##    2       3       3       0       3   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348       3       3       1       3   .       3       3       3       3
GDSArray(gds, "sample.id")
## <90> array of class GDSArray and type "character":
##   NA06984   NA06985   NA06986   .         NA12891   NA12892 
## "NA06984" "NA06985" "NA06986"         . "NA12891" "NA12892"

More details about GDS or GDSArray format can be found in the vignettes of the gdsfmt, SNPRelate, SeqArray, GDSArray and DelayedArray packages.

4.2 DelayedDataFrame

DelayedDataFrame is a Bioconductor package that implements delayed operations on DataFrame objects using standard DataFrame metaphor. Each column of data inside DelayedDataFrame is represented as 1-dimensional GDSArray with on-disk GDS file. Methods like show,validity check, [, [[ subsetting, rbind, cbind are implemented for DelayedDataFrame. The DelayedDataFrame stays lazy until an explicit realization call like DataFrame() constructor or as.list() incurred. More details about DelayedDataFrame data structure could be found in the vignette of DelayedDataFrame package.

5 VariantExperiment class

5.1 VariantExperiment class

VariantExperiment class is defined to extend RangedSummarizedExperiment. The difference would be that the assay data are saved as GDSArray, and the annotation data are saved by default as DelayedDataFrame (with option to save as ordinary DataFrame), both of which are representing the data on-disk with GDS back-end. There are coercion methods defined for VCF and GDS files into VariantExperiment objects (Check the method vignette for more details). Here we take an example for illustration.

ve <- makeVariantExperimentFromGDS(gds)
ve
## class: VariantExperiment 
## dim: 1348 90 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family

5.2 slot accessors

assay data are in GDSArray format, and could be retrieve by the assays()/assay() function.

assays(ve)
## List of length 3
## names(3): genotype/data phase/data annotation/format/DP/data
assay(ve, 1)
## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1       3       3       0       3   .       0       0       0       0
##    2       3       3       0       3   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348       3       3       0       3   .       3       3       3       3
## 
## ,,2
##      NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
##    1       3       3       0       3   .       0       0       0       0
##    2       3       3       0       3   .       0       0       0       0
##  ...       .       .       .       .   .       .       .       .       .
## 1347       0       0       0       0   .       0       0       0       0
## 1348       3       3       1       3   .       3       3       3       3

The rowData() of the VariantExperiment is saved in DelayedDataFrame format. We can use rowRanges() / rowData() to retrieve the feature-related annotation file, with/without a GenomicRange format.

rowRanges(ve)
## GRanges object with 1348 ranges and 14 metadata columns:
##        seqnames    ranges strand |          ID            ALT            REF
##           <Rle> <IRanges>  <Rle> |  <GDSArray> <DelayedArray> <DelayedArray>
##      1        1   1105366      * | rs111751804              C              T
##      2        1   1105411      * | rs114390380              A              G
##      3        1   1110294      * |   rs1320571              A              G
##    ...      ...       ...    ... .         ...            ...            ...
##   1346       22  43691009      * |   rs8135982              T              C
##   1347       22  43691073      * | rs116581756              A              G
##   1348       22  48958933      * |   rs5771206              G              A
##              QUAL     FILTER    info_AA    info_AC    info_AN    info_DP
##        <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
##      1        NaN       PASS          T          4        114       3251
##      2        NaN       PASS          G          1        106       2676
##      3        NaN       PASS          A          6        154       7610
##    ...        ...        ...        ...        ...        ...        ...
##   1346        NaN       PASS          C         11        142        823
##   1347        NaN       PASS          G          1        152       1257
##   1348        NaN       PASS          G          1          6         48
##          info_HM2   info_HM3    info_OR     info_GP    info_BN
##        <GDSArray> <GDSArray> <GDSArray>  <GDSArray> <GDSArray>
##      1          0          0              1:1115503        132
##      2          0          0              1:1115548        132
##      3          1          1              1:1120431         88
##    ...        ...        ...        ...         ...        ...
##   1346          0          0            22:45312345        116
##   1347          0          0            22:45312409        132
##   1348          0          0            22:50616806        114
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
##               ID            ALT            REF       QUAL     FILTER
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS
## 2    rs114390380              A              G        NaN       PASS
## 3      rs1320571              A              G        NaN       PASS
## ...          ...            ...            ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS
## 1347 rs116581756              A              G        NaN       PASS
## 1348   rs5771206              G              A        NaN       PASS
##         info_AA    info_AC    info_AN    info_DP   info_HM2   info_HM3
##      <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1             T          4        114       3251          0          0
## 2             G          1        106       2676          0          0
## 3             A          6        154       7610          1          1
## ...         ...        ...        ...        ...        ...        ...
## 1346          C         11        142        823          0          0
## 1347          G          1        152       1257          0          0
## 1348          G          1          6         48          0          0
##         info_OR     info_GP    info_BN
##      <GDSArray>  <GDSArray> <GDSArray>
## 1                 1:1115503        132
## 2                 1:1115548        132
## 3                 1:1120431         88
## ...         ...         ...        ...
## 1346            22:45312345        116
## 1347            22:45312409        132
## 1348            22:50616806        114

sample-related annotation is in DelayedDataFrame format, and could be retrieved by colData().

colData(ve)
## DelayedDataFrame with 90 rows and 1 column
##             family
##         <GDSArray>
## NA06984       1328
## NA06985           
## NA06986      13291
## ...            ...
## NA12890       1463
## NA12891           
## NA12892

The gdsfile() will retrieve the gds file path associated with the VariantExperiment object.

gdsfile(ve)
## [1] "/home/biocbuild/bbs-3.10-bioc/R/library/SeqArray/extdata/CEU_Exon.gds"

Some other getter function like metadata() will return any metadata that we have saved inside the VariantExperiment object.

metadata(ve)
## list()

6 Subsetting methods

The VariantExperiment object supports subsetting methods with [, $ and subsetByOverlap etc. as in SummarizedExperiment.

6.1 two-dimensional subsetting

ve[1:10, 1:5]
## class: VariantExperiment 
## dim: 10 5 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(10): 1 2 ... 9 10
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(5): NA06984 NA06985 NA06986 NA06989 NA06994
## colData names(1): family

6.2 $ subsetting

The $ subsetting could be operated directly on colData() columns, for easy sample extraction. Note that the colData/rowData are in the DelayedDataFrame format, with each column saved as GDSArray. So when doing subsetting, we need to use as.logical() to convert the 1-dimensional GDSArray into ordinary vector.

colData(ve)
## DelayedDataFrame with 90 rows and 1 column
##             family
##         <GDSArray>
## NA06984       1328
## NA06985           
## NA06986      13291
## ...            ...
## NA12890       1463
## NA12891           
## NA12892
ve[, as.logical(ve$family == "1328")]
## class: VariantExperiment 
## dim: 1348 2 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(2): NA06984 NA06989
## colData names(1): family

subsetting by rowData() columns.

rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
##               ID            ALT            REF       QUAL     FILTER
##       <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1    rs111751804              C              T        NaN       PASS
## 2    rs114390380              A              G        NaN       PASS
## 3      rs1320571              A              G        NaN       PASS
## ...          ...            ...            ...        ...        ...
## 1346   rs8135982              T              C        NaN       PASS
## 1347 rs116581756              A              G        NaN       PASS
## 1348   rs5771206              G              A        NaN       PASS
##         info_AA    info_AC    info_AN    info_DP   info_HM2   info_HM3
##      <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1             T          4        114       3251          0          0
## 2             G          1        106       2676          0          0
## 3             A          6        154       7610          1          1
## ...         ...        ...        ...        ...        ...        ...
## 1346          C         11        142        823          0          0
## 1347          G          1        152       1257          0          0
## 1348          G          1          6         48          0          0
##         info_OR     info_GP    info_BN
##      <GDSArray>  <GDSArray> <GDSArray>
## 1                 1:1115503        132
## 2                 1:1115548        132
## 3                 1:1120431         88
## ...         ...         ...        ...
## 1346            22:45312345        116
## 1347            22:45312409        132
## 1348            22:50616806        114
ve[as.logical(rowData(ve)$REF == "T"),]
## class: VariantExperiment 
## dim: 214 90 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(214): 1 4 ... 1320 1328
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family

6.3 Range-based operations

VariantExperiment objects support all of the findOverlaps() methods and associated functions. This includes subsetByOverlaps(), which makes it easy to subset a VariantExperiment object by an interval.

ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933"))
ve1
## class: VariantExperiment 
## dim: 23 90 
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(23): 1326 1327 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family

7 Save / load VariantExperiment object

Note that after the subsetting by [, $ or Ranged-based operations, and you feel satisfied with the data for downstream analysis, you need to save that VariantExperiment object to synchronize the gds file (on-disk) associated with the subset of data (in-memory representation) before any statistical analysis. Otherwise, an error will be returned.

For example, after we subset the ve by GRanges("22:1-48958933"), and we want to calculate the hwe based on the 23 variants, an error will be generated indicating that we need to sync the on-disk and in-memory representations.

hwe(ve1)
## Error in .saveGDSMaybe(gdsfile) : use
##   'saveVariantExperiment()' to synchronize on-disk and
##   in-memory representations

7.1 save VariantExperiment object

Use the function saveVariantExperiment to synchronize the on-disk and in-memory representation. This function writes the processed data as ve.gds, and save the R object (which lazily represent the backend data set) as ve.rds under the specified directory. It finally returns a new VariantExperiment object into current R session generated from the newly saved data.

a <- tempfile()
ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE)

7.2 load VariantExperiment object

You can alternatively use loadVariantExperiment to load the synchronized data into R session, by providing only the file directory. It reads the VariantExperiment object saved as ve.rds, as lazy representation of the backend ve.gds file under the specific directory.

ve3 <- loadVariantExperiment(dir=a)
gdsfile(ve3)
## [1] "/tmp/Rtmp79zWzD/file5f7c3bbfc70c/ve.gds"
all.equal(ve2, ve3)
## [1] TRUE

Now we are all set for any downstream analysis as needed.

head(hwe(ve2))
##   variant.id nAA nAa naa     afreq         p            f
## 1       1326  88   1   0 0.9943820 1.0000000 -0.005649718
## 2       1327  41  35  13 0.6573034 0.2421774  0.127084209
## 3       1328  39  27   8 0.7094595 0.3951190  0.114950166
## 4       1329  89   1   0 0.9944444 1.0000000 -0.005586592
## 5       1330  69   3   0 0.9791667 1.0000000 -0.021276596
## 6       1331  69   5   0 0.9662162 1.0000000 -0.034965035

8 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] VariantExperiment_1.0.0     DelayedDataFrame_1.2.0     
##  [3] GDSArray_1.6.0              gdsfmt_1.22.0              
##  [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
##  [7] BiocParallel_1.20.0         matrixStats_0.55.0         
##  [9] Biobase_2.46.0              GenomicRanges_1.38.0       
## [11] GenomeInfoDb_1.22.0         IRanges_2.20.0             
## [13] S4Vectors_0.24.0            BiocGenerics_0.32.0        
## [15] BiocStyle_2.14.0           
## 
## loaded via a namespace (and not attached):
##  [1] mitml_0.3-7            Rcpp_1.0.2             lattice_0.20-38       
##  [4] tidyr_1.0.0            Biostrings_2.54.0      assertthat_0.2.1      
##  [7] zeallot_0.1.0          digest_0.6.22          pan_1.6               
## [10] R6_2.4.0               jomo_2.6-10            backports_1.1.5       
## [13] evaluate_0.14          pillar_1.4.2           zlibbioc_1.32.0       
## [16] rlang_0.4.1            minqa_1.2.4            nloptr_1.2.1          
## [19] rpart_4.1-15           Matrix_1.2-17          rmarkdown_1.16        
## [22] splines_3.6.1          lme4_1.1-21            stringr_1.4.0         
## [25] GWASExactHW_1.01       RCurl_1.95-4.12        broom_0.5.2           
## [28] compiler_3.6.1         xfun_0.10              pkgconfig_2.0.3       
## [31] mgcv_1.8-30            htmltools_0.4.0        nnet_7.3-12           
## [34] tidyselect_0.2.5       tibble_2.1.3           GenomeInfoDbData_1.2.2
## [37] bookdown_0.14          crayon_1.3.4           dplyr_0.8.3           
## [40] MASS_7.3-51.4          bitops_1.0-6           grid_3.6.1            
## [43] nlme_3.1-141           lifecycle_0.1.0        magrittr_1.5          
## [46] SeqVarTools_1.24.0     stringi_1.4.3          XVector_0.26.0        
## [49] SNPRelate_1.20.0       mice_3.6.0             generics_0.0.2        
## [52] vctrs_0.2.0            boot_1.3-23            tools_3.6.1           
## [55] glue_1.3.1             purrr_0.3.3            survival_2.44-1.1     
## [58] yaml_2.2.0             SeqArray_1.26.0        BiocManager_1.30.9    
## [61] logistf_1.23           knitr_1.25