Introduction
Public health is an important driving force behind biological and medical research. A major challenge of the post-genomic era is bridging the gap between fundamental biological research and its clinical applications. Recent research has increasingly demonstrated that many seemingly dissimilar diseases have common molecular mechanisms. Understanding similarities among disease aids in early diagnosis and new drug development.
Formal knowledge representation of gene-disease association is demanded for this purpose. Ontologies, such as Gene Ontology (GO), have been successfully applied to represent biological knowledge, and many related techniques have been adopted to extract information. Disease Ontology (DO)(Schriml et al. 2011) was developed to create a consistent description of gene products with disease perspectives, and is essential for supporting functional genomics in disease context. Accurate disease descriptions can discover new relationships between genes and disease, and new functions for previous uncharacteried genes and alleles.
Unlike other clinical vocabularies that defined disease related concepts disparately, DO is organized as a directed acyclic graph, laying the foundation for quantitative computation of disease knowledge.
Here, we present an R
package DOSE[Yu et al. (2015)) for analyzing semantic similarities among DO terms and gene products annotated with DO terms.
DO term semantic similarity measurement
Four methods determine the semantic similarity of two terms based on the Information Content of their common ancestor term were proposed by Resnik(Philip 1999), Jiang(Jiang and Conrath 1997), Lin(Lin 1998) and Schlicker(Schlicker et al. 2006). Wang(Wang et al. 2007) presented a method to measure the similarity based on the graph structure. Each of these methods has its own advantage and weakness. DOSE implemented all these methods to compute semantic similarity among DO terms and gene products. We have developed another package GOSemSim(Yu et al. 2010) to explore the functional similarity at GO perspective, including molecular function (MF), biological process (BP) and cellular component (CC).
For algorithm details, please refer to the vignette of GOSemSim.
doSim function
In DOSE, we implemented doSim
for calculating semantic similarity between two DO terms and two set of DO terms.
a <- c("DOID:14095", "DOID:5844", "DOID:2044", "DOID:8432", "DOID:9146",
"DOID:10588", "DOID:3209", "DOID:848", "DOID:3341", "DOID:252")
b <- c("DOID:9409", "DOID:2491", "DOID:4467", "DOID:3498", "DOID:11256")
doSim(a[1], b[1], measure="Wang")
## [1] 0.07142995
## [1] 0
## [1] 0
## DOID:9409 DOID:2491 DOID:4467 DOID:3498 DOID:11256
## DOID:14095 0.07142995 0.05714393 0.03676194 0.03676194 0.52749870
## DOID:5844 0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:2044 0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:8432 0.17347273 0.13877811 0.03676194 0.03676194 0.07142995
## DOID:9146 0.07142995 0.05714393 0.03676194 0.03676194 0.17347273
## DOID:10588 0.13240905 0.18401515 0.02208240 0.02208240 0.05452137
## DOID:3209 0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:848 0.14897652 0.11564838 0.02801328 0.02801328 0.06134327
## DOID:3341 0.13240905 0.09998997 0.02208240 0.02208240 0.05452137
## DOID:252 0.06134327 0.04761992 0.02801328 0.02801328 0.06134327
The doSim
function requires three parameter DOID1
, DOID2
and measure
. DOID1
and DOID2
should be a vector of DO terms, while measure
should be one of Resnik, Jiang, Lin, Rel, and Wang.
We also implement a plot function simplot
to visualize the similarity result.
simplot(s,
color.low="white", color.high="red",
labs=TRUE, digits=2, labs.size=5,
font.size=14, xlab="", ylab="")
Parameter color.low
and colow.high
are used to setting the color gradient; labs
is a logical parameter indicating whether to show the similarity values or not, digits
to indicate the number of decimal places to be used and labs.size
control the font size of similarity values; font.size
setting the font size of axis and label of the coordinate system.
Gene semantic similarity measurement
On the basis of semantic similarity between DO terms, DOSE can also compute semantic similarity among gene products. DOSE provides four methods which called max
, avg
, rcmax
and BMA
to combine semantic similarity scores of multiple DO terms. The similarities among genes and gene clusters which annotated by multiple DO terms were also calculated by these combine methods. For calculation details, please refer to the vignette of GOSemSim.
geneSim function
In DOSE, we implemented geneSim
to measure semantic similarities among genes.
g1 <- c("84842", "2524", "10590", "3070", "91746")
g2 <- c("84289", "6045", "56999", "9869")
geneSim(g1[1], g2[1], measure="Wang", combine="BMA")
## [1] 0.051
## 84289 6045 56999 9869
## 84842 0.051 0.135 0.355 0.103
## 2524 0.284 0.172 0.517 0.517
## 10590 0.150 0.173 0.242 0.262
## 3070 0.573 0.517 1.000 1.000
## 91746 0.351 0.308 0.527 0.496
The geneSim
requires four parameter geneID1
, geneID2
, measure
and combine
. geneID1
and geneID2
should be a vector of entrez gene IDs; measure
should be one of Resnik, Jiang, Lin, Rel, and Wang, while combine
should be one of max, avg, rcmax and BMA as described previously.
The simplot
works well with both the output of doSim
and geneSim
.
clusterSim and mclusterSim
We also implemented clusterSim
for calculating semantic similarity between two gene clusters and mclusterSim
for calculating semantic similarities among multiple gene clusters.
## [1] 0.549
g3 <- c("57491", "6296", "51438", "5504", "27319", "1643")
clusters <- list(a=g1, b=g2, c=g3)
mclusterSim(clusters, measure="Wang", combine="BMA")
## a b c
## a 1.000 0.549 0.425
## b 0.549 1.000 0.645
## c 0.425 0.645 1.000
GO semantic similarity calculation
GO Semantic similarity can be calculated by GOSemSim(Yu et al. 2010).
MeSH semantic analysis
MeSH (Medical Subject Headings) is the NLM controlled vocabulary used to manually index articles for MEDLINE/PubMed. meshes supports enrichment (hypergeometric test and GSEA) and semantic similarity analyses for more than 70 species.
References
Jiang, Jay J., and David W. Conrath. 1997. “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy.” Proceedings of 10th International Conference on Research in Computational Linguistics.
Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the 15th International Conference on Machine Learning, 296—304.
Philip, Resnik. 1999. “Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language.” Journal of Artificial Intelligence Research 11:95–130.
Schlicker, Andreas, Francisco S Domingues, Jörg Rahnenführer, and Thomas Lengauer. 2006. “A New Measure for Functional Similarity of Gene Products Based on Gene Ontology.” BMC Bioinformatics 7:302.
Schriml, L. M., C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe. 2011. “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research 40 (D1):D940–D946. https://doi.org/10.1093/nar/gkr972.
Wang, James Z, Zhidian Du, Rapeeporn Payattakool, Philip S Yu, and Chin-Fu Chen. 2007. “A New Method to Measure the Semantic Similarity of Go Terms.” Bioinformatics (Oxford, England) 23 (May):1274–81.
Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An R Package for Measuring Semantic Similarity Among Go Terms and Gene Products.” Bioinformatics 26 (april):976–78. https://doi.org/10.1093/bioinformatics/btq064.
Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4):608–9. https://doi.org/10.1093/bioinformatics/btu684.