MOFA 1.2.0
MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in a completely unsupervised fashion. Intuitively, MOFA can be viewed as a versatile and statistically rigorous generalization of principal component analysis (PCA) to multi-omics data. Given several data matrices with measurements of multiple ‘omics data types on the same or on overlapping sets of samples, MOFA infers an interpretable low-dimensional data representation in terms of (hidden) factors. These learnt factors represent the driving sources of variation across data modalities, thus facilitating the identification of cellular states or disease subgroups.
Once trained, the model output can be used for a range of downstream analyses, including the visualisation of samples in factor space, the automatic annotation of factors using (gene set) enrichment analysis, the identification of outliers (e.g. due to sample swaps) and the imputation of missing values.
For more details on the methods you can read our paper at http://msb.embopress.org/cgi/doi/10.15252/msb.20178124.
MOFA
can be installed from Bioconductor using
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MOFA", version = "3.9")
The development version of the package can be installed from github using
devtools::install_github("bioFAM/MOFA", build_opts = c("--no-resave-data"))
As the package depends on the Pyhton package mofapy
, this needs to be installed alongside with the R package. For this you can use the command pip install mofapy
.
For illustration of MOFA and gene set enrichment analyses we make use of data contained in the accompanying data package MOFAdata
. This can be installed from Bioconductor using
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MOFAdata", version = "3.9")
The development version of the package can be installed from github using
devtools::install_github("bioFAM/MOFAdata", build_opts = c("--no-resave-data"))
Before running MOFA, you need to make sure that reticulate
is pointing to the correct python binary or conda environment.
This can become tricky when you have multiple conda environments and versions of Python installed:
library(reticulate)
# Using a specific python binary
use_python("/home/user/python", required = TRUE)
# Using a conda enviroment called "r-reticulate"
use_condaenv("r-reticulate", required = TRUE)
# Using a virtual environment called "r-reticulate"
use_virtualenv("r-reticulate", required = TRUE)
See also our FAQ section and read more about the reticulate package and how it integrates Python and R.
To load the package and the accompanying data package you can use:
library(MOFA)
library(MOFAdata)
The workflow of MOFA consists of two steps:
(1) Fitting step: train the model with the multi-omics data to disentangle the heterogeneity into a small number of latent factors.
(2) Downstream analysis: once the factors are inferred they need to be characterised as technical or biological sources of variation by looking at the corresponding weights, doing (gene set) enrichment analysis, plotting the factors, correlating factors with known covariates, etc. Also, one can do imputation of missing values and prediction of clinical outcomes using the latent factors.
First you need to create the MOFA object with your input data, and subsequently you need to train the model. Everything is explained in the vignettes.
If everything is successful, you should observe an output analogous to the following:
###########################################################
### __ __ ___ _____ _ ###
### | \/ |/ _ \| ___/ \ ###
### | |\/| | | | | |_ / _ \ ###
### | | | | |_| | _/ ___ \ ###
### |_| |_|\___/|_|/_/ \_\ ###
### ###
###########################################################
##################
## Loading data ##
##################
Loaded /Users/ricard/MOFA/MOFA/test/data/500_0.txt with dim (100,500)...
Loaded /Users/ricard/MOFA/MOFA/test/data/500_1.txt with dim (100,500)...
Loaded /Users/ricard/MOFA/MOFA/test/data/500_2.txt with dim (100,500)...
#############################################
## Running trial number 1 with seed 642034 ##
#############################################
Trial 1, Iteration 1: time=0.08 ELBO=-345954.96, Factors=10, Covariates=1
Trial 1, Iteration 2: time=0.10 ELBO=-283729.31, deltaELBO=62225.6421, Factors=10
Trial 1, Iteration 3: time=0.10 ELBO=-257427.42, deltaELBO=26301.8893, Factors=10
...
Trial 1, Iteration 100: time=0.07 ELBO=-221171.01, deltaELBO=0.0998, Factors=10
Converged!
There are two important quantities to keep track of:
MOFA disentangles the heterogeneity of a high-dimensional multi-omics data set into a set of latent factors that capture global sources of variation. Importantly, these factors can have different activity patterns in different omics. For example, a batch effect might be affecting the RNA data but not the Methylation data.
Once the heterogeneity of the data set is reduced into a set of factors, you need to understand what are they, and whether they capture technical or biological sources of variability.
We have built a semi-automated pipeline based on our experience annotating factors:
(1) Visualisation of the samples in the factor space: similarly to what is done in Principal Component Analysis, it is useful to plot the factors against each other and color the samples using known covariates such as batch, sex, clinical information, etc.
(2) Inspection of top weighted features: for example, if a factor is associated to the sex of the individual, the mRNA data will have very high loadings for genes located in the X and Y chromosomes.
(3) Feature set enrichment analysis: particularly when having large amounts of features, the inspection of loadings is challenging, and doing gene ontology enrichment analysis can be useful.
Please refer to the vignettes or the paper for details on the different analysis.
The latent factors can be used for several purposes, such as:
(1) Dimensionality reduction: similar to PCA, dimensionality reduction visualisations can be obtained by plotting the Factors against each other.
(2) Imputation: Factors can be used to predict missing values, including entire missing assays.
(3) Predicting clinical response: if the factors capture phenotypical information, they can capture clinical covariates of interest.
(4) Regressing out technical effects: if a factor is capturing an undesired technical effect, its effect can be regressed out from your original data matrix.
Please refer to the vignettes or the paper for details on the different analysis.
We illustrate the use of MOFA on various example data sets:
vignette("MOFA_example_simulated")
vignette("MOFA_example_CLL")
vignette("MOFA_example_scMT")
(Q) How do I normalise the data?
Always try to remove any technical source of variability before fitting the model.
For example, for count-based data such as RNA-seq or ATAC-seq we recommend size factor normalisation + variance stabilisation. For microarray DNA methylation data, make sure that samples have no differences in the average intensity.
If this is not done correctly, the model will learn a very strong Factor 1 that will capture this variability, and more subtle sources of variation will be harder to identify.
We have implemented a function called regressCovariates
that allows the user to regress out a covariate using a simple linear models. See the documentation and the CLL vignette for examples.
(Q) I get the following error when installing the R package:
ERROR: dependencies 'pcaMethods', 'MultiAssayExperiment' are not available for package 'MOFA'
These two packages are available from Bioconductor, not CRAN. You can install them from R as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("MultiAssayExperiment", "pcaMethods"))
(Q) I get one of the following errors when running MOFA:
AttributeError: 'module' object has no attribute 'core.entry_point
Error in py_module_import(module, convert = convert) :
ModuleNotFoundError: No module named 'mofapy'
First thing: restart R and try again. If the error still holds, this means that either:
which python
in the terminal. You can test if the mofa packaged is installed by running INSIDE python: import mofapy
.Once everything is figured out, specify the following at the beginning of your R script:
library(reticulate)
use_python("YOUR_PYTHON_PATH", required=TRUE) # fill in YOUR_PYTHON_PATH
You can also use use_conda
instead of use_python
if you work with conda environments. Read more about the reticulate package and how it integrates Python and R
(Q) I hate R, can I do MOFA only with Python?
Nope. You can use Python to train the model, see this template script. However, we currently do not provide downstream analysis functions in Python. We strongly recommend that you use our MOFA R package for this.
(Q) How many factors should I use?
Similar to Principal Component Analysis and other latent variable models, this is a hard question to answer. It depends on the data set and the aim of the analysis. As a general rule, the bigger the data set, the higher the number of factors that you will retrieve, and the less the variance that will be explained per factor.
If you want to get an overview on the major sources of variability then use a small number of factors (K<=15). If you want to capture small sources of variability, for example to do imputation or eQTL mapping, then go for a large number of factors (K>25).
(Q) How many samples do I need?
At least more than 15. Otherwise the model will not generate meaningful results.
(Q) Can MOFA automatically learn the number of factors?
Yes, but the user needs to specify a minimum value of % variance explained. Then, MOFA will actively remove factors (during training) that explain less than the specified amount of variance.
If you have no idea on what to expect, it is better to start with a fixed number of factors and set the % variance threshold to 0.
(Q) Can I put known covariates in the model?
Combining known covariates with latent factors is technically possible, but we extensively tested this functionality and it was not yielding good results. The reason is that covariates are usually discrete labels that do not reflect the underlying molecular biology. For example, if you introduce age as a covariate, but the actual age is different from the “molecular age”, the model will simply learn a new factor that corresponds to this “latent” molecular age, and it will drop the covariate from the model.
We recommend that you learn the factors in a completely unsupervised manner and then relate them to the biological covariates via visualisation or via a simple correlation analysis (see our vignettes). If your covariate of interest is indeed an important driver of variability, do not worry, MOFA will find it!
(Q) Should I remove undesired sources of variability (i.e. batch effects) before fitting the model?
Yes, if you have clear technical factors, we strongly encourage to regress it out a priori using a simple linear model. The reason for this is that the model will “focus” on the huge variability driven by the technical factors, and smaller sources of variability could be missed.
You can regress out known covaraites using the function regressCovariates
. See the corresponding documentation and the CLL vignette for details.
(Q) Should I do any filtering to the input data?
You must remove features with zero variance and ideally also features with low variance, as they can cause numerical issues in the model. In practice we generally select the top N most variable features per assay
(Q) My data sets have different dimensionalities, does this matter?
Yes, this is important. Bigger data modalities will tend to be overrepresent in the MOFA model. It is good practice to filter features (based for example on variance, as lowly variable features provide little information) in order to have the different dimensionalities within the same order of magnitudes. If this is unavoidable, take into account that the model has the risk of missing (small) sources of variation unique to the small data set.
(Q) The weights have different values between runs. Is this expected?
This is normal and it happens because of to two reasons. The first one is that the model does not always converge to the same exact solution (see below in the FAQ), although different model instances should be pretty similar. The second reason is that factor analysis models are rotation invariant. This means that you can rotate your factors and your weights and still find the same solution. This implies that the signs of the weight or the factors can NOT be compared across trials, only within a trial.
(Q) What data modalities can MOFA cope with?
(Q) How do I assess convergence?
MOFA is trained using variational bayes, a fast inference framework that consists on optimising a statistica called the Evidence Lower Bound (ELBO). The model uses the change in ELBO (deltaELBO) to assess convergence. A model is defined to be converged when deltaELBO is close to 0. For a quick exploratory analysis, we suggest a convergence threshold between 1 to 10.
(Q) The model does not converge smoothly, and it oscillates between positive and negative deltaELBO values
First, check that you are using the right likelihood model (see above). Second, make sure that you have no features or samples that are full of missing values. Third, check that you have no features with zero (or very little) variance. If the problem does not disappear, please contact us via mail or the Slack group, we will provide (quick!) help.
(Q) What input formats are allowed?
The data has to be input in two possible formats:
(Q) Does MOFA always converge to the same solutions?
No, as occurs in most complex Bayesian models, they are not guaranteed to always converge to the same (optimal) solution.
In practice, however, we observed that the solutions are highly consistent, particularly for strong factors. However, one should always assess the robustness and do a proper model selection. For this we recommend to train the model multiple times and check the robustness of the factors across the different solutions. For downstream analysis a single model can be chosen based on the best value of the Evidence Lower Bound (ELBO). We provide functions for these two steps, which are explained in the vignette vignette(MOFA_example_simulated)
.
(Q) How does MOFA handle missing values?
It simpy ignores them, there is no a priori imputation step required. In fact, matrix factorisation models are known to be very robust to the presence of large amounts of missing values.
(Q) How can I do Gene Set Enrichment Analysis?
First, you need to create your binary gene set matrix where rows are feature sets and columns are features (genes). We have manually processed some of Reactome and MSigDB gene sets for mouse and human. Contact us if you would like to use the data.
Then, you will have to choose a local statistic per feature (the loading, by default), a global statistic per pathway (average loading, by default), and a statistical test. The most trustworthy one is a permutation test with a long number of iterations, but this is slow and a fast parametric tests is also available. However, note that it tends to inflate the p-values due to the correlation structure between related genes (see for example Gatti2010).
MOFAmodel it is the main S4 class used to store all relevant data to analyse a MOFA model. Its slots are the following (accessible using @ or the corresponding accessor function):
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] BiocManager_1.30.9 compiler_3.6.1 magrittr_1.5
## [4] bookdown_0.14 htmltools_0.4.0 tools_3.6.1
## [7] yaml_2.2.0 Rcpp_1.0.2 stringi_1.4.3
## [10] rmarkdown_1.16 knitr_1.25 stringr_1.4.0
## [13] digest_0.6.22 xfun_0.10 rlang_0.4.1
## [16] evaluate_0.14