VariantExperiment 1.0.0
This package includes 2 vignettes. This One is about the class
definitions, basic methods and operations of VariantExperiment
that
are defined or directly inherited from SummarizedExperiment
. For
users who want to apply VariantExperiment
directly to their analysis
high-throughput genotyping or DNA-seq data sets, please refer to the
other vignette.
As the sequencing data (DNA-seq, RNA-seq, single cell RNA-seq…) gets increasingly larger-profile, the memory space in R has been an obstable for fast and efficient data processing, because most available R or Bioconductor packages are developed based on in-memory data manipulation. SingleCellExperiment has achieved efficient on-disk saving/reading of the large-scale count data as HDF5Array objects. However, there was still no such light-weight containers available for high-throughput DNA-seq / genotyping data.
We have developed VariantExperiment, a Bioconductor package for
saving the VCF/GDS
format data into RangedSummarizedExperiment
object. The high-throughput genetic/genomic data are saved in
GDSArray objects, a direct extension of DelayedArray with GDS
back-end. In addition to the light-weight Assay
data, We have also
realized the on-disk saving of annotation data for both
features/samples (corresponding to rowData/colData
) by implementing
the DelayedDataFrame data structure. The on-disk representation
of both assay data and annotation data realizes on-disk reading and
processing and saves R memory space significantly. The interface of
RangedSummarizedExperiment
data format enables easy and common
manipulations for high-throughput genetic/genomic data with common
SummarizedExperiment metaphor in R and Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("VariantExperiment")
Or install the development version of the package from Github.
BiocManager::install(“Bioconductor/VariantExperiment”)
library(VariantExperiment)
GDSArray is a Bioconductor package that represents GDS
files as
objects derived from the DelayedArray package and DelayedArray
class. It converts GDS
nodes into a DelayedArray
-derived data
structure. The rich common methods and data operations defined on
GDSArray
makes it more R-user-friendly than working with the GDS
file directly. The array data from GDS files are always returned with
the first dimension being features (genes/variants/snps) and the
second dimension being samples
. This feature is consistent with the
assay data saved in SummarizedExperiment
, and makes the GDSArray
package more interoperable with other established Bioconductor data
infrastructure.
The GDSArray()
constructor takes 2 arguments: the file path and the
GDS node name inside the GDS file.
gds <- SeqArray::seqExampleFileName("gds")
GDSArray(gds, "genotype/data")
## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 3 3 0 3 . 0 0 0 0
## 2 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## 1347 0 0 0 0 . 0 0 0 0
## 1348 3 3 0 3 . 3 3 3 3
##
## ,,2
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 3 3 0 3 . 0 0 0 0
## 2 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## 1347 0 0 0 0 . 0 0 0 0
## 1348 3 3 1 3 . 3 3 3 3
GDSArray(gds, "sample.id")
## <90> array of class GDSArray and type "character":
## NA06984 NA06985 NA06986 . NA12891 NA12892
## "NA06984" "NA06985" "NA06986" . "NA12891" "NA12892"
More details about GDS
or GDSArray
format can be found in the
vignettes of the gdsfmt, SNPRelate, SeqArray, GDSArray
and DelayedArray packages.
DelayedDataFrame is a Bioconductor package that implements
delayed operations on DataFrame
objects using standard DataFrame
metaphor. Each column of data inside DelayedDataFrame
is represented
as 1-dimensional GDSArray
with on-disk GDS file. Methods like
show
,validity check
, [
, [[
subsetting, rbind
, cbind
are implemented for
DelayedDataFrame
. The DelayedDataFrame
stays lazy until an
explicit realization call like DataFrame()
constructor or
as.list()
incurred. More details about DelayedDataFrame data
structure could be found in the vignette of DelayedDataFrame
package.
VariantExperiment
classVariantExperiment
classVariantExperiment
class is defined to extend
RangedSummarizedExperiment
. The difference would be that the assay
data are saved as GDSArray
, and the annotation data are saved by
default as DelayedDataFrame
(with option to save as ordinary
DataFrame
), both of which are representing the data on-disk with GDS
back-end. There are coercion methods defined for VCF
and GDS
files into
VariantExperiment
objects (Check the method vignette for more details).
Here we take an example for illustration.
ve <- makeVariantExperimentFromGDS(gds)
ve
## class: VariantExperiment
## dim: 1348 90
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family
assay data are in GDSArray
format, and could be retrieve by the
assays()/assay()
function.
assays(ve)
## List of length 3
## names(3): genotype/data phase/data annotation/format/DP/data
assay(ve, 1)
## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 3 3 0 3 . 0 0 0 0
## 2 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## 1347 0 0 0 0 . 0 0 0 0
## 1348 3 3 0 3 . 3 3 3 3
##
## ,,2
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 3 3 0 3 . 0 0 0 0
## 2 3 3 0 3 . 0 0 0 0
## ... . . . . . . . . .
## 1347 0 0 0 0 . 0 0 0 0
## 1348 3 3 1 3 . 3 3 3 3
The rowData()
of the VariantExperiment
is saved in
DelayedDataFrame
format. We can use rowRanges()
/ rowData()
to
retrieve the feature-related annotation file, with/without a
GenomicRange format.
rowRanges(ve)
## GRanges object with 1348 ranges and 14 metadata columns:
## seqnames ranges strand | ID ALT REF
## <Rle> <IRanges> <Rle> | <GDSArray> <DelayedArray> <DelayedArray>
## 1 1 1105366 * | rs111751804 C T
## 2 1 1105411 * | rs114390380 A G
## 3 1 1110294 * | rs1320571 A G
## ... ... ... ... . ... ... ...
## 1346 22 43691009 * | rs8135982 T C
## 1347 22 43691073 * | rs116581756 A G
## 1348 22 48958933 * | rs5771206 G A
## QUAL FILTER info_AA info_AC info_AN info_DP
## <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1 NaN PASS T 4 114 3251
## 2 NaN PASS G 1 106 2676
## 3 NaN PASS A 6 154 7610
## ... ... ... ... ... ... ...
## 1346 NaN PASS C 11 142 823
## 1347 NaN PASS G 1 152 1257
## 1348 NaN PASS G 1 6 48
## info_HM2 info_HM3 info_OR info_GP info_BN
## <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1 0 0 1:1115503 132
## 2 0 0 1:1115548 132
## 3 1 1 1:1120431 88
## ... ... ... ... ... ...
## 1346 0 0 22:45312345 116
## 1347 0 0 22:45312409 132
## 1348 0 0 22:50616806 114
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
## ID ALT REF QUAL FILTER
## <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1 rs111751804 C T NaN PASS
## 2 rs114390380 A G NaN PASS
## 3 rs1320571 A G NaN PASS
## ... ... ... ... ... ...
## 1346 rs8135982 T C NaN PASS
## 1347 rs116581756 A G NaN PASS
## 1348 rs5771206 G A NaN PASS
## info_AA info_AC info_AN info_DP info_HM2 info_HM3
## <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1 T 4 114 3251 0 0
## 2 G 1 106 2676 0 0
## 3 A 6 154 7610 1 1
## ... ... ... ... ... ... ...
## 1346 C 11 142 823 0 0
## 1347 G 1 152 1257 0 0
## 1348 G 1 6 48 0 0
## info_OR info_GP info_BN
## <GDSArray> <GDSArray> <GDSArray>
## 1 1:1115503 132
## 2 1:1115548 132
## 3 1:1120431 88
## ... ... ... ...
## 1346 22:45312345 116
## 1347 22:45312409 132
## 1348 22:50616806 114
sample-related annotation is in DelayedDataFrame
format, and could
be retrieved by colData()
.
colData(ve)
## DelayedDataFrame with 90 rows and 1 column
## family
## <GDSArray>
## NA06984 1328
## NA06985
## NA06986 13291
## ... ...
## NA12890 1463
## NA12891
## NA12892
The gdsfile()
will retrieve the gds file path associated with the
VariantExperiment
object.
gdsfile(ve)
## [1] "/home/biocbuild/bbs-3.10-bioc/R/library/SeqArray/extdata/CEU_Exon.gds"
Some other getter function like metadata()
will return any metadata
that we have saved inside the VariantExperiment
object.
metadata(ve)
## list()
The VariantExperiment
object supports subsetting methods with [
,
$
and subsetByOverlap
etc. as in SummarizedExperiment
.
ve[1:10, 1:5]
## class: VariantExperiment
## dim: 10 5
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(10): 1 2 ... 9 10
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(5): NA06984 NA06985 NA06986 NA06989 NA06994
## colData names(1): family
$
subsettingThe $
subsetting could be operated directly on colData()
columns,
for easy sample extraction. Note that the colData/rowData
are in the
DelayedDataFrame
format, with each column saved as GDSArray
. So
when doing subsetting, we need to use as.logical()
to convert the
1-dimensional GDSArray
into ordinary vector.
colData(ve)
## DelayedDataFrame with 90 rows and 1 column
## family
## <GDSArray>
## NA06984 1328
## NA06985
## NA06986 13291
## ... ...
## NA12890 1463
## NA12891
## NA12892
ve[, as.logical(ve$family == "1328")]
## class: VariantExperiment
## dim: 1348 2
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(2): NA06984 NA06989
## colData names(1): family
subsetting by rowData()
columns.
rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
## ID ALT REF QUAL FILTER
## <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray>
## 1 rs111751804 C T NaN PASS
## 2 rs114390380 A G NaN PASS
## 3 rs1320571 A G NaN PASS
## ... ... ... ... ... ...
## 1346 rs8135982 T C NaN PASS
## 1347 rs116581756 A G NaN PASS
## 1348 rs5771206 G A NaN PASS
## info_AA info_AC info_AN info_DP info_HM2 info_HM3
## <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1 T 4 114 3251 0 0
## 2 G 1 106 2676 0 0
## 3 A 6 154 7610 1 1
## ... ... ... ... ... ... ...
## 1346 C 11 142 823 0 0
## 1347 G 1 152 1257 0 0
## 1348 G 1 6 48 0 0
## info_OR info_GP info_BN
## <GDSArray> <GDSArray> <GDSArray>
## 1 1:1115503 132
## 2 1:1115548 132
## 3 1:1120431 88
## ... ... ... ...
## 1346 22:45312345 116
## 1347 22:45312409 132
## 1348 22:50616806 114
ve[as.logical(rowData(ve)$REF == "T"),]
## class: VariantExperiment
## dim: 214 90
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(214): 1 4 ... 1320 1328
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family
VariantExperiment
objects support all of the findOverlaps()
methods and associated functions. This includes subsetByOverlaps()
,
which makes it easy to subset a VariantExperiment
object by an
interval.
ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933"))
ve1
## class: VariantExperiment
## dim: 23 90
## metadata(0):
## assays(3): genotype/data phase/data annotation/format/DP/data
## rownames(23): 1326 1327 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family
VariantExperiment
objectNote that after the subsetting by [
, $
or Ranged-based operations
, and you feel satisfied with the data for downstream
analysis, you need to save that VariantExperiment
object to
synchronize the gds file (on-disk) associated with the subset of data
(in-memory representation) before any statistical analysis. Otherwise,
an error will be returned.
For example, after we subset the ve
by GRanges("22:1-48958933")
,
and we want to calculate the hwe based on the 23 variants, an error
will be generated indicating that we need to sync the on-disk and
in-memory representations.
hwe(ve1)
## Error in .saveGDSMaybe(gdsfile) : use
## 'saveVariantExperiment()' to synchronize on-disk and
## in-memory representations
VariantExperiment
objectUse the function saveVariantExperiment
to synchronize the on-disk
and in-memory representation. This function writes the processed data
as ve.gds
, and save the R object (which lazily represent the
backend data set) as ve.rds
under the specified directory. It
finally returns a new VariantExperiment
object into current R
session generated from the newly saved data.
a <- tempfile()
ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE)
VariantExperiment
objectYou can alternatively use loadVariantExperiment
to load the
synchronized data into R session, by providing only the file
directory. It reads the VariantExperiment
object saved as ve.rds
, as lazy
representation of the backend ve.gds
file under the specific
directory.
ve3 <- loadVariantExperiment(dir=a)
gdsfile(ve3)
## [1] "/tmp/Rtmp79zWzD/file5f7c3bbfc70c/ve.gds"
all.equal(ve2, ve3)
## [1] TRUE
Now we are all set for any downstream analysis as needed.
head(hwe(ve2))
## variant.id nAA nAa naa afreq p f
## 1 1326 88 1 0 0.9943820 1.0000000 -0.005649718
## 2 1327 41 35 13 0.6573034 0.2421774 0.127084209
## 3 1328 39 27 8 0.7094595 0.3951190 0.114950166
## 4 1329 89 1 0 0.9944444 1.0000000 -0.005586592
## 5 1330 69 3 0 0.9791667 1.0000000 -0.021276596
## 6 1331 69 5 0 0.9662162 1.0000000 -0.034965035
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] VariantExperiment_1.0.0 DelayedDataFrame_1.2.0
## [3] GDSArray_1.6.0 gdsfmt_1.22.0
## [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
## [7] BiocParallel_1.20.0 matrixStats_0.55.0
## [9] Biobase_2.46.0 GenomicRanges_1.38.0
## [11] GenomeInfoDb_1.22.0 IRanges_2.20.0
## [13] S4Vectors_0.24.0 BiocGenerics_0.32.0
## [15] BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] mitml_0.3-7 Rcpp_1.0.2 lattice_0.20-38
## [4] tidyr_1.0.0 Biostrings_2.54.0 assertthat_0.2.1
## [7] zeallot_0.1.0 digest_0.6.22 pan_1.6
## [10] R6_2.4.0 jomo_2.6-10 backports_1.1.5
## [13] evaluate_0.14 pillar_1.4.2 zlibbioc_1.32.0
## [16] rlang_0.4.1 minqa_1.2.4 nloptr_1.2.1
## [19] rpart_4.1-15 Matrix_1.2-17 rmarkdown_1.16
## [22] splines_3.6.1 lme4_1.1-21 stringr_1.4.0
## [25] GWASExactHW_1.01 RCurl_1.95-4.12 broom_0.5.2
## [28] compiler_3.6.1 xfun_0.10 pkgconfig_2.0.3
## [31] mgcv_1.8-30 htmltools_0.4.0 nnet_7.3-12
## [34] tidyselect_0.2.5 tibble_2.1.3 GenomeInfoDbData_1.2.2
## [37] bookdown_0.14 crayon_1.3.4 dplyr_0.8.3
## [40] MASS_7.3-51.4 bitops_1.0-6 grid_3.6.1
## [43] nlme_3.1-141 lifecycle_0.1.0 magrittr_1.5
## [46] SeqVarTools_1.24.0 stringi_1.4.3 XVector_0.26.0
## [49] SNPRelate_1.20.0 mice_3.6.0 generics_0.0.2
## [52] vctrs_0.2.0 boot_1.3-23 tools_3.6.1
## [55] glue_1.3.1 purrr_0.3.3 survival_2.44-1.1
## [58] yaml_2.2.0 SeqArray_1.26.0 BiocManager_1.30.9
## [61] logistf_1.23 knitr_1.25