The Organism.dplyr creates an on disk sqlite database to hold data of an organism combined from an ‘org’ package (e.g., org.Hs.eg.db) and a genome coordinate functionality of the ‘TxDb’ package (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene). It aims to provide an integrated presentation of identifiers and genomic coordinates. And a src_organism object is created to point to the database.
The src_organism object is created as an extension of src_sql and src_sqlite from dplyr, which inherited all dplyr methods. It also implements the select()
interface from AnnotationDbi and genomic coordinates extractors from GenomicFeatures.
The src_organism()
constructor creates an on disk sqlite database file with data from a given ‘TxDb’ package and corresponding ‘org’ package. When dbpath is given, file is created at the given path, otherwise temporary file is created.
library(Organism.dplyr)
Running src_organism()
without a given path will save the sqlite file to a tempdir()
:
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")
Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date.
path <- "path/to/my.sqlite"
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path)
supportedOrganisms()
provides a list of organisms with corresponding ‘org’ and ‘TxDb’ packages being supported.
supportedOrganisms()
## # A tibble: 21 x 3
## organism OrgDb
## <chr> <chr>
## 1 Bos taurus org.Bt.eg.db
## 2 Caenorhabditis elegans org.Ce.eg.db
## 3 Caenorhabditis elegans org.Ce.eg.db
## 4 Canis familiaris org.Cf.eg.db
## 5 Drosophila melanogaster org.Dm.eg.db
## 6 Drosophila melanogaster org.Dm.eg.db
## 7 Danio rerio org.Dr.eg.db
## 8 Gallus gallus org.Gg.eg.db
## 9 Homo sapiens org.Hs.eg.db
## 10 Homo sapiens org.Hs.eg.db
## # ... with 11 more rows, and 1 more variables: TxDb <chr>
Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent ‘TxDb’ package is used.
src <- src_ucsc("human", path)
An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with TxDb.Hsapiens.UCSC.hg38.knownGene, with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package
src <- src_organism(dbpath = hg38light())
src
## src: sqlite 3.19.3 [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
## id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx
All methods from package dplyr can be used for a src_organism object.
Look at all available tables.
src_tbls(src)
## [1] "id_accession" "id_transcript" "id" "id_omim_pm"
## [5] "id_protein" "id_go" "id_go_all" "ranges_gene"
## [9] "ranges_tx" "ranges_exon" "ranges_cds"
Look at data from one specific table.
tbl(src, "id")
## # Source: table<id> [?? x 6]
## # Database: sqlite 3.19.3
## # [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## entrez map ensembl symbol genename alias
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 19q13.4 ENSG00000121410 A1BG alpha-1-B glycoprotein A1B
## 2 1 19q13.4 ENSG00000121410 A1BG alpha-1-B glycoprotein ABG
## 3 1 19q13.4 ENSG00000121410 A1BG alpha-1-B glycoprotein GAB
## 4 1 19q13.4 ENSG00000121410 A1BG alpha-1-B glycoprotein HYST2477
## 5 1 19q13.4 ENSG00000121410 A1BG alpha-1-B glycoprotein A1BG
## 6 10 8p22 ENSG00000156006 NAT2 N-acetyltransferase 2 AAC2
## 7 10 8p22 ENSG00000156006 NAT2 N-acetyltransferase 2 NAT-2
## 8 10 8p22 ENSG00000156006 NAT2 N-acetyltransferase 2 PNAT
## 9 10 8p22 ENSG00000156006 NAT2 N-acetyltransferase 2 NAT2
## 10 100 20q13.12 ENSG00000196839 ADA adenosine deaminase ADA
## # ... with more rows
Look at fields of one table.
colnames(tbl(src, "id"))
## [1] "entrez" "map" "ensembl" "symbol" "genename" "alias"
Below are some examples of querying tables using dplyr.
SNORD%
is from SQL, with %
representing a wild-card match to any string)tbl(src, "id") %>%
filter(symbol %like% "SNORD%") %>%
dplyr::select(entrez, map, ensembl, symbol) %>%
distinct() %>% arrange(symbol) %>% collect()
## # A tibble: 9 x 4
## entrez map ensembl symbol
## <chr> <chr> <chr> <chr>
## 1 100033413 15q11.2 ENSG00000207063 SNORD116-1
## 2 100033414 15q11.2 ENSG00000207001 SNORD116-2
## 3 100033415 15q11.2 ENSG00000207014 SNORD116-3
## 4 100033416 15q11.2 ENSG00000275529 SNORD116-4
## 5 100033417 15q11.2 ENSG00000207191 SNORD116-5
## 6 100033418 15q11.2 ENSG00000207442 SNORD116-6
## 7 100033419 15q11.2 ENSG00000207133 SNORD116-7
## 8 100033420 15q11.2 ENSG00000207093 SNORD116-8
## 9 100033421 15q11.2 ENSG00000206727 SNORD116-9
inner_join(tbl(src, "id"), tbl(src, "id_go")) %>%
filter(symbol == "ADA") %>%
dplyr::select(entrez, ensembl, symbol, go, evidence, ontology)
## Joining, by = "entrez"
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.19.3
## # [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## entrez ensembl symbol go evidence ontology
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 100 ENSG00000196839 ADA GO:0001666 IDA BP
## 2 100 ENSG00000196839 ADA GO:0001821 IEA BP
## 3 100 ENSG00000196839 ADA GO:0001829 IEA BP
## 4 100 ENSG00000196839 ADA GO:0001883 IEA MF
## 5 100 ENSG00000196839 ADA GO:0001889 IEA BP
## 6 100 ENSG00000196839 ADA GO:0001890 IEA BP
## 7 100 ENSG00000196839 ADA GO:0002314 IEA BP
## 8 100 ENSG00000196839 ADA GO:0002636 IEA BP
## 9 100 ENSG00000196839 ADA GO:0002686 IEA BP
## 10 100 ENSG00000196839 ADA GO:0002906 IEA BP
## # ... with more rows
txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>%
dplyr::select(symbol, tx_id) %>%
group_by(symbol) %>%
summarise(count = count(symbol)) %>%
dplyr::select(symbol, count) %>%
arrange(desc(count)) %>%
collect(n=Inf)
## Joining, by = "entrez"
txcount
## # A tibble: 18 x 2
## symbol count
## <chr> <int>
## 1 AKT3 270
## 2 RNA5-8S5 128
## 3 NAALAD2 50
## 4 A1BG 40
## 5 MED6 39
## 6 CDH2 25
## 7 NR2E3 18
## 8 RNA28S5 18
## 9 NAT2 8
## 10 ANO1-AS2 6
## 11 SNORD116-2 4
## 12 SNORD116-3 4
## 13 SNORD116-5 4
## 14 ADA 3
## 15 SNORD116-1 3
## 16 SNORD116-4 2
## 17 SNORD116-8 2
## 18 ZBTB11-AS1 1
library(ggplot2)
ggplot(txcount, aes(x = symbol, y = count)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Transcript count") +
labs(x = "Symbol") +
labs(y = "Count")
inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
filter(symbol %in% c("ADA", "NAT2")) %>%
dplyr::select(gene_chrom, gene_start, gene_end, gene_strand,
symbol, map) %>%
collect() %>% GenomicRanges::GRanges()
## Joining, by = "entrez"
## GRanges object with 5 ranges and 2 metadata columns:
## seqnames ranges strand | symbol map
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr8 [18391245, 18401218] + | NAT2 8p22
## [2] chr8 [18391245, 18401218] + | NAT2 8p22
## [3] chr8 [18391245, 18401218] + | NAT2 8p22
## [4] chr8 [18391245, 18401218] + | NAT2 8p22
## [5] chr20 [44619522, 44651742] - | ADA 20q13.12
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Methods select()
, keytypes()
, keys()
, columns()
and mapIds
from AnnotationDbi are implemented for src_organism objects.
Use keytypes()
to discover which keytypes can be passed to keytype argument of methods select()
or keys()
.
keytypes(src)
## [1] "accnum" "alias" "cds_chrom" "cds_end"
## [5] "cds_id" "cds_name" "cds_start" "cds_strand"
## [9] "ensembl" "ensemblprot" "ensembltrans" "entrez"
## [13] "enzyme" "evidence" "evidenceall" "exon_chrom"
## [17] "exon_end" "exon_id" "exon_name" "exon_rank"
## [21] "exon_start" "exon_strand" "gene_chrom" "gene_end"
## [25] "gene_start" "gene_strand" "genename" "go"
## [29] "goall" "ipi" "map" "omim"
## [33] "ontology" "ontologyall" "pfam" "pmid"
## [37] "prosite" "refseq" "symbol" "tx_chrom"
## [41] "tx_end" "tx_id" "tx_name" "tx_start"
## [45] "tx_strand" "tx_type" "unigene" "uniprot"
Use columns()
to discover which kinds of data can be returned for the src_organism object.
columns(src)
## [1] "accnum" "alias" "cds_chrom" "cds_end"
## [5] "cds_id" "cds_name" "cds_start" "cds_strand"
## [9] "ensembl" "ensemblprot" "ensembltrans" "entrez"
## [13] "enzyme" "evidence" "evidenceall" "exon_chrom"
## [17] "exon_end" "exon_id" "exon_name" "exon_rank"
## [21] "exon_start" "exon_strand" "gene_chrom" "gene_end"
## [25] "gene_start" "gene_strand" "genename" "go"
## [29] "goall" "ipi" "map" "omim"
## [33] "ontology" "ontologyall" "pfam" "pmid"
## [37] "prosite" "refseq" "symbol" "tx_chrom"
## [41] "tx_end" "tx_id" "tx_name" "tx_start"
## [45] "tx_strand" "tx_type" "unigene" "uniprot"
keys()
returns keys for the src_organism object. By default it returns the primary keys for the database, and returns the keys from that keytype when the keytype argument is used.
Keys of entrez
head(keys(src))
## [1] "1" "10" "100" "1000" "10000" "100008586"
Keys of symbol
head(keys(src, "symbol"))
## [1] "A1BG" "ADA" "AKT3" "ANO1-AS2" "CDH2" "DUXB"
select()
retrieves the data as a tibble based on parameters for selected keys columns and keytype arguments. If requested columns that have multiple matches for the keys, select_tbl()
will return a tibble with one row for each possible match, and select()
will return a data frame.
keytype <- "symbol"
keys <- c("ADA", "NAT2")
columns <- c("entrez", "tx_id", "tx_name","exon_id")
select_tbl(src, keys, columns, keytype)
## Joining, by = "entrez"
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.19.3
## # [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## symbol entrez tx_id tx_name exon_id
## <chr> <chr> <int> <chr> <int>
## 1 ADA 100 169786 uc061xfj.1 501379
## 2 ADA 100 169786 uc061xfj.1 501383
## 3 ADA 100 169786 uc061xfj.1 501385
## 4 ADA 100 169786 uc061xfj.1 501387
## 5 ADA 100 169786 uc061xfj.1 501389
## 6 ADA 100 169786 uc061xfj.1 501390
## 7 ADA 100 169786 uc061xfj.1 501392
## 8 ADA 100 169787 uc002xmj.4 501379
## 9 ADA 100 169787 uc002xmj.4 501382
## 10 ADA 100 169787 uc002xmj.4 501384
## # ... with more rows
mapIds()
gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector.
mapIds(src, keys, column = "tx_name", keytype)
## Joining, by = "entrez"
## ADA NAT2
## "uc002xmj.4" "uc003wyw.2"
mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList")
## Joining, by = "entrez"
## CharacterList of length 2
## [["ADA"]] uc002xmj.4 uc061xfj.1 uc061xfl.1
## [["NAT2"]] uc003wyw.2 uc064kqw.1
Eleven genomic coordinates extractor methods are available in this package: transcripts()
, exons()
, cds()
, genes()
, promoters()
, transcriptsBy()
, exonsBy()
, cdsBy()
, intronsByTranscript()
, fiveUTRsByTranscript()
, threeUTRsByTranscript()
. Data can be returned in two versions, for instance tibble (transcripts_tbl()
) and GRanges or GRangesList (transcripts()
).
Filters can be applied to all extractor functions. A named list of vectors can be used to restrict the output, valid filters can be retrieved by supportedFilters()
.
supportedFilters()
## [1] "AccnumFilter" "AliasFilter" "CdsChromFilter"
## [4] "CdsNameFilter" "CdsStrandFilter" "EnsemblFilter"
## [7] "EnsemblprotFilter" "EnsembltransFilter" "EntrezFilter"
## [10] "EnzymeFilter" "EvidenceFilter" "EvidenceallFilter"
## [13] "ExonChromFilter" "ExonNameFilter" "ExonStrandFilter"
## [16] "FlybaseFilter" "FlybaseCgFilter" "FlybaseProtFilter"
## [19] "GeneChromFilter" "GeneStrandFilter" "GenenameFilter"
## [22] "GoFilter" "GoallFilter" "IpiFilter"
## [25] "MapFilter" "MgiFilter" "OmimFilter"
## [28] "OntologyFilter" "OntologyallFilter" "PfamFilter"
## [31] "PmidFilter" "PrositeFilter" "RefseqFilter"
## [34] "SymbolFilter" "TxChromFilter" "TxNameFilter"
## [37] "TxStrandFilter" "TxTypeFilter" "UnigeneFilter"
## [40] "UniprotFilter" "WormbaseFilter" "ZfinFilter"
## [43] "CdsIdFilter" "CdsStartFilter" "CdsEndFilter"
## [46] "ExonIdFilter" "ExonStartFilter" "ExonEndFilter"
## [49] "ExonRankFilter" "GeneStartFilter" "GeneEndFilter"
## [52] "TxIdFilter" "TxStartFilter" "TxEndFilter"
All filters take two parameters: value and condition, condition could be one of “==”, “!=”, “startsWith”, “endsWith”, “>”, “<”, “>=” and “<=”, default condition is “==”.
EnsemblFilter("ENSG00000196839")
## class: EnsemblFilter
## condition: ==
## value: ENSG00000196839
SymbolFilter("SNORD", "startsWith")
## class: SymbolFilter
## condition: startsWith
## value: SNORD
A GRangesFilter()
can also be used as filter for the methods with result displaying as GRanges or GRangesList.
filters <- list(SymbolFilter("SNORD", "startsWith"))
transcripts_tbl(src, filter=filters)
## Joining, by = "entrez"
## # Source: lazy query [?? x 7]
## # Database: sqlite 3.19.3
## # [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## # Ordered by: tx_id
## tx_chrom tx_start tx_end tx_strand tx_id tx_name symbol
## <chr> <int> <int> <chr> <int> <chr> <chr>
## 1 chr15 25051477 25051571 + 123215 uc001yxg.4 SNORD116-1
## 2 chr15 25054210 25054304 + 123216 uc001yxi.4 SNORD116-2
## 3 chr15 25056860 25056954 + 123217 uc001yxk.4 SNORD116-3
## 4 chr15 25059538 25059633 + 123218 uc001yxl.5 SNORD116-4
## 5 chr15 25062333 25062427 + 123219 uc001yxo.4 SNORD116-5
## 6 chr15 25065026 25065121 + 123221 uc001yxp.5 SNORD116-2
## 7 chr15 25067788 25067882 + 123222 uc059gus.1 SNORD116-5
## 8 chr15 25070432 25070526 + 123223 uc001yxr.4 SNORD116-8
## 9 chr15 25073107 25073201 + 123224 uc001yxs.4 SNORD116-3
filters <- list(
SymbolFilter("SNORD", "startsWith"),
GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121"))
)
transcripts(src, filter=filters)
## Joining, by = "entrez"
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr15 [25062333, 25062427] + | 123219 uc001yxo.4
## [2] chr15 [25065026, 25065121] + | 123221 uc001yxp.5
## symbol
## <character>
## [1] SNORD116-5
## [2] SNORD116-2
## -------
## seqinfo: 455 sequences (1 circular) from hg38 genome
Transcript coordinates of gene symbol equal to “ADA” and transcript start position between 87863438 and 87933487.
transcripts_tbl(src, filter = list(
SymbolFilter("ADA"),
TxStartFilter(44619810,"<")
))
## Joining, by = "entrez"
## # Source: lazy query [?? x 7]
## # Database: sqlite 3.19.3
## # [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## # Ordered by: tx_id
## tx_chrom tx_start tx_end tx_strand tx_id tx_name symbol
## <chr> <int> <int> <chr> <int> <chr> <chr>
## 1 chr20 44619522 44626491 - 169786 uc061xfj.1 ADA
## 2 chr20 44619522 44651742 - 169787 uc002xmj.4 ADA
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2 ggplot2_2.2.1 GenomicRanges_1.28.6
## [4] GenomeInfoDb_1.12.3 IRanges_2.10.5 S4Vectors_0.14.7
## [7] BiocGenerics_0.22.1 Organism.dplyr_1.2.2 dplyr_0.7.4
## [10] BiocStyle_2.4.1
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.6.5 lattice_0.20-35
## [3] colorspace_1.3-2 htmltools_0.3.6
## [5] BiocFileCache_1.0.1 rtracklayer_1.36.6
## [7] yaml_2.1.14 GenomicFeatures_1.28.5
## [9] blob_1.1.0 XML_3.98-1.9
## [11] rlang_0.1.2 glue_1.1.1
## [13] DBI_0.7 BiocParallel_1.10.1
## [15] rappdirs_0.3.1 bit64_0.9-7
## [17] dbplyr_1.1.0 plyr_1.8.4
## [19] matrixStats_0.52.2 GenomeInfoDbData_0.99.0
## [21] bindr_0.1 stringr_1.2.0
## [23] zlibbioc_1.22.0 Biostrings_2.44.2
## [25] munsell_0.4.3 gtable_0.2.0
## [27] memoise_1.1.0 evaluate_0.10.1
## [29] labeling_0.3 Biobase_2.36.2
## [31] knitr_1.17 biomaRt_2.32.1
## [33] AnnotationDbi_1.38.2 Rcpp_0.12.13
## [35] scales_0.5.0 backports_1.1.1
## [37] DelayedArray_0.2.7 XVector_0.16.0
## [39] bit_1.1-12 Rsamtools_1.28.0
## [41] digest_0.6.12 stringi_1.1.5
## [43] grid_3.4.2 rprojroot_1.2
## [45] tools_3.4.2 bitops_1.0-6
## [47] magrittr_1.5 lazyeval_0.2.0
## [49] RCurl_1.95-4.8 tibble_1.3.4
## [51] RSQLite_2.0 pkgconfig_2.0.1
## [53] Matrix_1.2-11 assertthat_0.2.0
## [55] rmarkdown_1.6 httr_1.3.1
## [57] R6_2.2.2 GenomicAlignments_1.12.2
## [59] compiler_3.4.2