BgeeDB
is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from humans and many animals.
The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the data reprocessed by the Bgee pipeline. It works for all the species in Bgee. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages, and is based on the topGO
package.
This package allows: 1. Listing annotation files gene of expression data available in the current version of Bgee database 2. Downloading the processed gene expression data available in the current version of Bgee database 3. Downloading the gene expression calls and annotations and using them to perform TopAnat analyses
library(BgeeDB)
listBgeeSpecies()
## Last update: Jul11:54
## $`Species in the Database`
## ID GENUS SPECIES COMMON
## 1 9606 Homo sapiens human
## 2 10090 Mus musculus mouse
## 3 7955 Danio rerio zebrafish
## 4 7227 Drosophila melanogaster fruitfly
## 5 6239 Caenorhabditis elegans c.elegans
## 6 9598 Pan troglodytes chimpanzee
## 7 9597 Pan paniscus bonobo
## 8 9593 Gorilla gorilla gorilla
## 9 9544 Macaca mulatta macacque
## 10 10116 Rattus norvegicus rat
## 11 9913 Bos taurus cow
## 12 9823 Sus scrofa pig
## 13 13616 Monodelphis domestica opossum
## 14 9258 Ornithorhynchus anatinus platypus
## 15 9031 Gallus gallus chicken
## 16 28377 Anolis carolinensis anolis
## 17 8364 Xenopus tropicalis xenopus
## 18 9600 Pongo pygmaeus orangutan
## 19 99883 Tetraodon nigroviridis tetraodon
##
## $RNAseq
## [1] "Anolis_carolinensis" "Bos_taurus"
## [3] "Caenorhabditis_elegans" "Gallus_gallus"
## [5] "Gorilla_gorilla" "Homo_sapiens"
## [7] "Macaca_mulatta" "Monodelphis_domestica"
## [9] "Mus_musculus" "Ornithorhynchus_anatinus"
## [11] "Pan_paniscus" "Pan_troglodytes"
## [13] "Rattus_norvegicus" "Sus_scrofa"
## [15] "Xenopus_tropicalis"
##
## $Affymetrix
## [1] "Caenorhabditis_elegans" "Danio_rerio"
## [3] "Drosophila_melanogaster" "Homo_sapiens"
## [5] "Mus_musculus"
From the list above, please choose one species (e.g., “Homo_sapiens”, “Mus_musculus”,…) and platform (“rna_seq” or “affymetrix”).
bgee <- Bgee$new(species = "Mus_musculus", datatype = "rna_seq")
The bgee$get_annotation()
will output the list of experiments and libraries currently available in Bgee for RNA-seq of Mus musculus. The bgee$get_annotation()
loads the annotation in R, but also creates the Mus musculus folder in your current path, where it saves the downloaded annotation locally, so you can use the annotation for later as well.
# check the path where your folder with annotation will be saved. The folder is named after your chosen species.
getwd()
annotation_bgee_mouse <- bgee$get_annotation()
## Downloading annotation files...
## Saved annotation files in Mus_musculus folder.
# head the experiments and libraries
lapply(annotation_bgee_mouse, head)
## $sample_annotation
## Experiment ID Library ID Library secondary ID Anatomical entity ID
## 1 GSE30617 GSM759583 ERX012363 UBERON:0000948
## 2 GSE30617 GSM759584 ERX012348 UBERON:0000948
## 3 GSE30617 GSM759585 ERX012344 UBERON:0000948
## 4 GSE30617 GSM759586 ERX012362 UBERON:0000948
## 5 GSE30617 GSM759587 ERX012378 UBERON:0000948
## 6 GSE30617 GSM759588 ERX012374 UBERON:0000948
## Anatomical entity name Stage ID Stage name
## 1 heart MmusDv:0000052 8 weeks (mouse)
## 2 heart MmusDv:0000052 8 weeks (mouse)
## 3 heart MmusDv:0000052 8 weeks (mouse)
## 4 heart MmusDv:0000052 8 weeks (mouse)
## 5 heart MmusDv:0000052 8 weeks (mouse)
## 6 heart MmusDv:0000052 8 weeks (mouse)
## Platform ID Library type Library orientation Read count
## 1 Illumina Genome Analyzer II paired unstranded 31000737
## 2 Illumina Genome Analyzer II paired unstranded 8605668
## 3 Illumina Genome Analyzer II paired unstranded 30075234
## 4 Illumina Genome Analyzer II paired unstranded 29498377
## 5 Illumina Genome Analyzer II paired unstranded 27824366
## 6 Illumina Genome Analyzer II paired unstranded 31160789
## Left part mapped read count/Mapped read count
## 1 23674754
## 2 7099912
## 3 24308959
## 4 24792064
## 5 19632932
## 6 21342634
## Right part mapped read count Min. read length Max. read length
## 1 23809225 76 76
## 2 7083684 76 76
## 3 23884114 76 76
## 4 24122931 76 76
## 5 20578789 76 76
## 6 23529915 76 76
## All genes percent present Protein coding genes percent present
## 1 50.73 31.84
## 2 41.94 25.69
## 3 50.70 31.66
## 4 49.12 30.51
## 5 50.24 31.51
## 6 50.32 31.72
## Intergenic regions percent present Run IDs Discarded run IDs
## 1 0.98 ERR032227 NA
## 2 1.17 ERR032228 NA
## 3 0.95 ERR032229 NA
## 4 0.87 ERR032238 NA
## 5 0.97 ERR032230 NA
## 6 1.04 ERR032231 NA
## Data source Data source URL
## 1 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759583
## 2 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759584
## 3 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759585
## 4 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759586
## 5 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759587
## 6 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM759588
## Bgee normalized data URL
## 1 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 2 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 3 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 4 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 5 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 6 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## Raw file URL
## 1 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759583
## 2 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759584
## 3 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759585
## 4 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759586
## 5 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759587
## 6 http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=GSM759588
##
## $experiment_annotation
## Experiment ID Library count Condition count Organ count Stage count
## 1 GSE30617 36 6 6 1
## 2 GSE41637 26 9 9 1
## 3 GSE30352 17 6 6 1
## 4 GSE36026 12 12 12 3
## 5 GSE43520 9 5 4 2
## 6 GSE41338 6 5 5 1
## Data source Data source URL
## 1 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617
## 2 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637
## 3 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30352
## 4 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36026
## 5 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43520
## 6 GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41338
## Bgee normalized data URL
## 1 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30617.tsv.zip
## 2 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE41637.tsv.zip
## 3 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE30352.tsv.zip
## 4 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE36026.tsv.zip
## 5 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE43520.tsv.zip
## 6 ftp://lausanne.isb-sib.ch/pub/databases/Bgee/bgee_v13_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_RPKM_GSE41338.tsv.zip
## Experiment name
## 1 [E-MTAB-599] Mouse Transcriptome
## 2 Evolutionary dynamics of gene and isoform regulation in mammalian tissues
## 3 The evolution of gene expression levels in mammalian organs
## 4 RNA-seq from ENCODE/LICR
## 5 The evolution of lncRNA repertoires and expression patterns in tetrapods
## 6 The evolutionary landscape of alternative splicing in vertebrate species
## Experiment description
## 1 Sequencing the transcriptome of DBAxC57BL/6J mice. To study the regulation of transcription, splicing and RNA turnover we have sequenced the transcriptomes of tissues collected DBAxC57BL/6J mice.
## 2 Most mammalian genes produce multiple distinct mRNAs through alternative splicing, but the extent of splicing conservation is not clear. To assess tissue-specific transcriptome variation across mammals, we sequenced cDNA from 9 tissues from 4 mammals and one bird in biological triplicate, at unprecedented depth. We find that while tissue-specific gene expression programs are largely conserved, alternative splicing is well conserved in only a subset of tissues and is frequently lineage-specific. Thousands of novel, lineage-specific and conserved alternative exons were identified; widely conserved alternative exons had signatures of binding by MBNL, PTB, RBFOX, STAR and TIA family splicing factors, implicating them as ancestral mammalian splicing regulators. Our data also indicates that alternative splicing is often used to alter protein phosphorylatability, delimiting the scope of kinase signaling.
## 3 Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses of gene expression evolution were until recently prevented by technological limitations. Here we report the sequencing of polyadenylated RNA from six organs across ten species that represent all major mammalian lineages (placentals, marsupials and monotremes) and birds (the evolutionary outgroup), with the goal of understanding the dynamics of mammalian transcriptome evolution. We show that the rate of gene expression evolution varies among organs, lineages and chromosomes, owing to differences in selective pressures: transcriptome change was slow in nervous tissues and rapid in testes, slower in rodents than in apes and monotremes, and rapid for the X chromosome right after its formation. Although gene expression evolution in mammals was strongly shaped by purifying selection, we identify numerous potentially selectively driven expression switches, which occurred at different rates across lineages and tissues and which probably contributed to the specific organ biology of various mammals. Our transcriptome data provide a valuable resource for functional and evolutionary analyses of mammalian genomes.
## 4 Using RNA-Seq (Mortazavi et al., 2008), high-resolution genome-wide maps of the mouse transcriptome across multiple mouse (C57Bl/6) tissues and primary cells were generated.
## 5 To broaden our understanding of lncRNA evolution, we used an extensive RNA-seq dataset to establish lncRNA repertoires and homologous gene families in 11 tetrapod species. We analyzed the poly- adenylated transcriptomes of 8 organs (cortex/whole brain without cerebellum, cerebellum, heart, kidney, liver, placenta, ovary and testis) and 11 species (human, chimpanzee, bonobo, gorilla, orangutan, macaque, mouse, opossum, platypus, chicken and the frog Xenopus tropicalis), which shared a common ancestor ~370 millions of years (MY) ago. Our dataset included 47 strand-specific samples, which allowed us to confirm the orientation of gene predictions and to address the evolution of sense-antisense transcripts. See also GSE43721 (Soumillon et al, Cell Reports, 2013) for three strand-specific samples for mouse brain, liver and testis.
## 6 How species with similar repertoires of protein coding genes differ so dramatically at the phenotypic level is poorly understood. From comparing the transcriptomes of multiple organs from vertebrate species spanning ~350 million years of evolution, we observe significant differences in alternative splicing complexity between the main vertebrate lineages, with the highest complexity in the primate lineage. Moreover, within as little as six million years, the splicing profiles of physiologically-equivalent organs have diverged to the extent that they are more strongly related to the identity of a species than they are to organ type. Most vertebrate species-specific splicing patterns are governed by the highly variable use of a largely conserved cis-regulatory code. However, a smaller number of pronounced species-dependent splicing changes are predicted to remodel interactions involving factors acting at multiple steps in gene regulation. These events are expected to further contribute to the dramatic diversification of alternative splicing as well as to other gene regulatory changes that contribute to phenotypic differences among vertebrate species.
The bgee$get_data()
will download read counts and RPKMs for Mus musculus from all available experiments in Bgee database as a list (see below). In case of downloaded data from all experiments for Mus musculus, bgee$get_data()
will save the downloaded data in your current folder for later usage.
# download all RPKMs and counts for Mus musculus
data_bgee_mouse <- bgee$get_data()
## The experiment is not defined. Hence taking all rna_seq available for Mus_musculus .
## Downloading expression data...
## Saved expression data file in Mus_musculus folder.
## Unzipping file...
##
Read 61.7% of 1410444 rows
Read 1410444 rows and 13 (of 13) columns from 0.210 GB file in 00:00:03
## Saving all data in .rds file...
# the number of experiments downloaded from Bgee
length(data_bgee_mouse)
## [1] 7
# see your first experiment
data_bgee_experiment1 <- data_bgee_mouse[[1]]
head(data_bgee_experiment1)
## Experiment ID Library ID Library type Gene ID
## 1 GSE30352 GSM752614 single ENSMUSG00000000001
## 2 GSE30352 GSM752614 single ENSMUSG00000000003
## 3 GSE30352 GSM752614 single ENSMUSG00000000028
## 4 GSE30352 GSM752614 single ENSMUSG00000000031
## 5 GSE30352 GSM752614 single ENSMUSG00000000037
## 6 GSE30352 GSM752614 single ENSMUSG00000000049
## Anatomical entity ID Anatomical entity name Stage ID
## 1 UBERON:0000955 brain UBERON:0000113
## 2 UBERON:0000955 brain UBERON:0000113
## 3 UBERON:0000955 brain UBERON:0000113
## 4 UBERON:0000955 brain UBERON:0000113
## 5 UBERON:0000955 brain UBERON:0000113
## 6 UBERON:0000955 brain UBERON:0000113
## Stage name Read count RPKM Detection flag
## 1 post-juvenile adult stage 550 11.92921 present
## 2 post-juvenile adult stage 0 0.00000 absent
## 3 post-juvenile adult stage 12 0.53941 absent
## 4 post-juvenile adult stage 2 0.11157 absent
## 5 post-juvenile adult stage 13 0.27897 absent
## 6 post-juvenile adult stage 7 0.74416 absent
## Detection quality State in Bgee
## 1 high quality Used in expression call
## 2 high quality Result excluded, reason: pre-filtering
## 3 high quality Used in expression call
## 4 high quality Result excluded, reason: noExpression conflict
## 5 high quality Used in no-expression call
## 6 high quality Used in expression call
Alternatively, you can choose to download only one experiment from Bgee, as in the example below. The data is then saved as a .tsv file in your current folder.
# download RPKMs and counts only for GSE30617 for Mus musculus
data_bgee_mouse_gse30617 <- bgee$get_data(experiment.id = "GSE30617")
## Downloading expression data for the experiment GSE30617
## Downloading expression data...
## Saved expression data file in Mus_musculus folder.
## Unzipping file...
##
Read 85.1% of 1410444 rows
Read 1410444 rows and 13 (of 13) columns from 0.210 GB file in 00:00:03
## Saving all data in .rds file...
The data from different samples will be listed in rows, one after the other. It is sometimes easier to work with data organized as a matrix, where different columns represent different samples. To transform the data into a matrix with genes in rows and samples in columns, you can use the bgee$format_data()
function that separates data according to anatomical term. This function also allows to filter out genes that are not called present in a given sample (giving them NA values).
# only present calls and rpkm values for GSE30617
gene.expression.mouse.rpkm <- bgee$format_data(data_bgee_mouse_gse30617, calltype = "expressed", stats = "rpkm")
## keeping only expressed genes...
## Extracting features...
## Done...
## Extracting pheno...
## Done...
names(gene.expression.mouse.rpkm)
## NULL
head(gene.expression.mouse.rpkm$"Ammon's horn")
## NULL
The loadTopAnatData()
function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.
# Loading calls of expression based on RNA-seq data only
myTopAnatData <- loadTopAnatData(species=10090, datatype="rna_seq")
##
## Building URLs to retrieve organ relationships from Bgee.........
## URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.ontologycommon.RelationDAO.getAnatEntityRelations&display_type=tsv&species_list=10090&attr_list=SOURCE_ID&attr_list=TARGET_ID)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
##
## Building URLs to retrieve organ names from Bgee.................
## URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.anatdev.AnatEntityDAO.getAnatEntities&display_type=tsv&species_list=10090&attr_list=ID&attr_list=NAME)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
##
## Building URLs to retrieve mapping of gene to organs from Bgee...
## URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.expressiondata.ExpressionCallDAO.getExpressionCalls&display_type=tsv&species_list=10090&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&data_type=RNA_SEQ)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
##
## Parsing the results...............................................
##
## Adding BGEE:0 as unique root of all terms of the ontology.........
##
## Done.
# Loading calls observed in embryonic stages only
myTopAnatData <- loadTopAnatData(species=10090, datatype="rna_seq", stage="UBERON:0000068")
##
## Warning: an organ relationships file was found in the working directory and will be used as is. Please delete and rerun the function if you want the data to be updated.
##
## Warning: an organ names file was found in the working directory and will be used as is. Please delete and rerun the function if you want the data to be updated.
##
## Building URLs to retrieve mapping of gene to organs from Bgee...
## URL successfully built (http://r.bgee.org/?page=dao&action=org.bgee.model.dao.api.expressiondata.ExpressionCallDAO.getExpressionCalls&display_type=tsv&species_list=10090&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&data_type=RNA_SEQ&stage_id=UBERON:0000068)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/RtmpvM7P2H/Rbuild28744f1dbf67/BgeeDB/vignettes/"
##
## Parsing the results...............................................
##
## Adding BGEE:0 as unique root of all terms of the ontology.........
##
## Done.
# Look at the data
lapply(myTopAnatData, head)
## $gene2anatomy
## ENSMUSG00000000001 ENSMUSG00000000028 ENSMUSG00000000031
## "CL:0002322" "CL:0002322" "CL:0002322"
## ENSMUSG00000000037 ENSMUSG00000000056 ENSMUSG00000000058
## "CL:0002322" "CL:0002322" "CL:0002322"
##
## $organ.relationships
## $organ.relationships$`AEO:0000013`
## [1] "UBERON:0000479"
##
## $organ.relationships$`AEO:0000127`
## [1] "UBERON:0005423"
##
## $organ.relationships$`AEO:0000129`
## [1] "UBERON:0005423"
##
## $organ.relationships$`AEO:0000131`
## [1] "UBERON:0005423"
##
## $organ.relationships$`AEO:0000153`
## [1] "AEO:0000187"
##
## $organ.relationships$`AEO:0000162`
## [1] "UBERON:0006984"
##
##
## $organ.names
## organId organName
## 1 AEO:0001009 proliferating neuroepithelium
## 2 AEO:0001010 differentiating neuroepithelium
## 3 AEO:0001013 neuronal column
## 4 CL:0000005 fibroblast neural crest derived
## 5 CL:0000027 smooth muscle cell neural crest derived
## 6 CL:0000041 mature eosinophil
Note: the results are stored in files (see the pathToData
arguments). To save time, if you query again with the exact same parameters, these cache files will be read instead of querying the web-service. So do not delete the files in the working folder if you plan to perform additional queries.
First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata
object in the topGO
package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes, annotated with “spermatogenesis” in the Gene Ontology (using the biomaRt
package).
source("https://bioconductor.org/biocLite.R")
## Bioconductor version 3.3 (BiocInstaller 1.22.3), ?biocLite for help
biocLite("biomaRt")
## BioC_mirror: https://bioconductor.org
## Using Bioconductor 3.3 (BiocInstaller 1.22.3), R 3.3.1 (2016-06-21).
## Installing package(s) 'biomaRt'
## Old packages: 'EnrichmentBrowser', 'reutils', 'scran'
library(biomaRt)
ensembl <- useMart("ensembl")
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
# Foreground genes are those with a GO annotation "spermatogenesis"
myGenes <- getBM(attributes= "ensembl_gene_id", filters=c("go_id"), values=list(c("GO:0007283")), mart=ensembl)
# Background are all genes with a GO annotation
universe <- getBM(attributes= "ensembl_gene_id", filters=c("with_go_go"), values=list(c(TRUE)), mart=ensembl)
# Prepares the gene list vector
geneList <- factor(as.integer(universe[,1] %in% myGenes[,1]))
names(geneList) <- universe[,1]
head(geneList)
## ENSMUSG00000064370 ENSMUSG00000064368 ENSMUSG00000064367
## 0 0 0
## ENSMUSG00000064363 ENSMUSG00000065947 ENSMUSG00000064360
## 0 0 0
## Levels: 0 1
summary(geneList == 1)
## Mode FALSE TRUE NA's
## logical 21239 335 0
# Prepares the topGO object
myTopAnatObject <- topAnat(myTopAnatData, geneList)
##
## Checking topAnatData object.........
##
## Checking gene list..................
## Warning: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis. 11803 genes in background will be kept.
##
## Building 'most specific' Terms...... ( 1 Terms found. )
##
## Build DAG topology.................. ( 13 terms and 15 relations. )
##
## Annotating nodes (Can be long)...... ( 11803 genes annotated to the nodes. )
Warning: This can be long, especially if the gene list is large, since the anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.
For this step, see the vignette of the topGO
package for more details, as you have to directly use the tests implemented in the topGO
package, as shown in this example:
results <- runTest(myTopAnatObject, algorithm = 'classic', statistic = 'fisher')
##
## -- Classic Algorithm --
##
## the algorithm is scoring 13 nontrivial nodes
## parameters:
## test statistic: fisher
# You can also use the topGO decorrelation methods, for example the "weight" method to get less redundant results
results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')
##
## -- Weight Algorithm --
##
## The algorithm is scoring 13 nontrivial nodes
## parameters:
## test statistic: fisher : ratio
##
## Level 9: 1 nodes to be scored.
##
## Level 8: 1 nodes to be scored.
##
## Level 7: 1 nodes to be scored.
##
## Level 6: 2 nodes to be scored.
##
## Level 5: 1 nodes to be scored.
##
## Level 4: 1 nodes to be scored.
##
## Level 3: 3 nodes to be scored.
##
## Level 2: 2 nodes to be scored.
##
## Level 1: 1 nodes to be scored.
Warning: This can be long because of the size of the ontology. Consider running a script in batch mode if you have multiple analyses to do.
We built the makeTable
function to filter and format the test results. Results are sorted by p-value, and FDR values are calculated.
# Display results sigificant at a 1% FDR threshold
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, 0.01)
##
## Warning: there was no significant term at a FDR threshold of 0.01
# Display all results
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, 1)
##
## Building the results table for the 13 significant terms at FDR threshold of 1... Done
Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO
package.