waddR
packagewaddR
is an R package that provides a 2-Wasserstein distance based statistical test for detecting and describing differential distributions in one-dimensional data. Functions for wasserstein distance calculation, differential distribution testing, and a specialized test for differential expression in scRNA data are provided.
The package waddR
provides three sets of utilities to cover distinct use cases, each described in a separate vignette:
Fast and accurate calculation of the 2-Wasserstein distance
Two-sample test to check for differences between two distributions
Detect differential gene expression distributions in scRNAseq data
These are bundled into the same package, because they are internally dependent: The procedure for detecting differential distributions in single-cell data is a refinement of the general two-sample test, which itself uses the 2-Wasserstein distance to compare two distributions.
The 2-Wasserstein distance is a metric to describe the distance between two distributions, representing two diferent conditions A and B. This package specifically considers the squared 2-Wasserstein distance d := W^2 which offers a decomposition into location, size, and shape terms.
The package waddR
offers three functions to calculate the 2-Wasserstein distance, all of which are implemented in Cpp and exported to R with Rcpp for better performance. The function wasserstein_metric
is a Cpp reimplementation of the function wasserstein1d
from the package transport
and offers the most exact results. The functions squared_wass_approx
and squared_wass_decomp
compute approximations of the squared 2-Wasserstein distance with squared_wass_decomp
also returning the decomosition terms for location, size, and shape. See ?wasserstein_metric
, ?squared_wass_aprox
, and ?squared_wass_decomp
.
This package provides two testing procedures using the 2-Wasserstein distance to test whether two distributions F_A and F_B given in the form of samples are different ba specifically testing the null hypothesis H0: F_A = F_B against the alternative hypothesis H1: F_A != F_B.
The first, semi-parametric (SP), procedure uses a test based on permutations combined with a generalized pareto distribution approximation to estimate small pvalues accurately.
The second procedure (ASY) uses a test based on asymptotic theory which is valid only if the samples can be assumed to come from continuous distributions.
See ?wasserstein.test
for more details.
semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in single-cell RNA-seqencing (scRNA-seq) data. In particular, a two-stage (TS) approach has been implemented that takes account of the specific nature of scRNA-seq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semi-parametric 2-Wasserstein distance-based test) between two conditions.
See the documentation of the single cell procedure ?wasserstein.sc
and the test for zero expression levels ?testZeroes
for more details.
To install waddR
from Bioconductor, use BiocManager
with the following commands:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("MyPackage")
Using BiocManager
, the package can also be installed from github directly:
The package waddR
can then be used in R:
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] waddR_1.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.16.1 tidyselect_1.0.0
#> [3] xfun_0.12 purrr_0.3.3
#> [5] splines_3.6.3 lattice_0.20-40
#> [7] vctrs_0.2.4 htmltools_0.4.0
#> [9] stats4_3.6.3 BiocFileCache_1.10.2
#> [11] yaml_2.2.1 blob_1.2.1
#> [13] rlang_0.4.5 nloptr_1.2.2.1
#> [15] pillar_1.4.3 glue_1.3.2
#> [17] DBI_1.1.0 BiocParallel_1.20.1
#> [19] rappdirs_0.3.1 SingleCellExperiment_1.8.0
#> [21] BiocGenerics_0.32.0 bit64_0.9-7
#> [23] dbplyr_1.4.2 matrixStats_0.56.0
#> [25] GenomeInfoDbData_1.2.2 stringr_1.4.0
#> [27] zlibbioc_1.32.0 coda_0.19-3
#> [29] memoise_1.1.0 evaluate_0.14
#> [31] Biobase_2.46.0 knitr_1.28
#> [33] IRanges_2.20.2 GenomeInfoDb_1.22.0
#> [35] parallel_3.6.3 curl_4.3
#> [37] Rcpp_1.0.4 arm_1.10-1
#> [39] DelayedArray_0.12.2 S4Vectors_0.24.3
#> [41] XVector_0.26.0 abind_1.4-5
#> [43] bit_1.1-15.2 lme4_1.1-21
#> [45] digest_0.6.25 stringi_1.4.6
#> [47] dplyr_0.8.5 GenomicRanges_1.38.0
#> [49] grid_3.6.3 tools_3.6.3
#> [51] bitops_1.0-6 magrittr_1.5
#> [53] RCurl_1.98-1.1 tibble_2.1.3
#> [55] RSQLite_2.2.0 crayon_1.3.4
#> [57] pkgconfig_2.0.3 MASS_7.3-51.5
#> [59] Matrix_1.2-18 minqa_1.2.4
#> [61] assertthat_0.2.1 rmarkdown_2.1
#> [63] httr_1.4.1 boot_1.3-24
#> [65] R6_2.4.1 nlme_3.1-145
#> [67] compiler_3.6.3