Contents

1 Introduction

The Organism.dplyr creates an on disk sqlite database to hold data of an organism combined from an ‘org’ package (e.g., org.Hs.eg.db) and a genome coordinate functionality of the ‘TxDb’ package (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene). It aims to provide an integrated presentation of identifiers and genomic coordinates. And a src_organism object is created to point to the database.

The src_organism object is created as an extension of src_sql and src_sqlite from dplyr, which inherited all dplyr methods. It also implements the select() interface from AnnotationDbi and genomic coordinates extractors from GenomicFeatures.

2 Constructing a src_organism

2.1 Make sqlite datebase from ‘TxDb’ package

The src_organism() constructor creates an on disk sqlite database file with data from a given ‘TxDb’ package and corresponding ‘org’ package. When dbpath is given, file is created at the given path, otherwise temporary file is created.

library(Organism.dplyr)

Running src_organism() without a given path will save the sqlite file to a tempdir():

src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")

Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date.

path <- "path/to/my.sqlite"
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path)

supportedOrganisms() provides a list of organisms with corresponding ‘org’ and ‘TxDb’ packages being supported.

supportedOrganisms()
## # A tibble: 21 x 3
##                   organism        OrgDb
##                      <chr>        <chr>
##  1              Bos taurus org.Bt.eg.db
##  2  Caenorhabditis elegans org.Ce.eg.db
##  3  Caenorhabditis elegans org.Ce.eg.db
##  4        Canis familiaris org.Cf.eg.db
##  5 Drosophila melanogaster org.Dm.eg.db
##  6 Drosophila melanogaster org.Dm.eg.db
##  7             Danio rerio org.Dr.eg.db
##  8           Gallus gallus org.Gg.eg.db
##  9            Homo sapiens org.Hs.eg.db
## 10            Homo sapiens org.Hs.eg.db
## # ... with 11 more rows, and 1 more variables: TxDb <chr>

2.2 Make sqlite datebase from organism name

Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent ‘TxDb’ package is used.

src <- src_ucsc("human", path)

2.3 Access existing sqlite file

An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with TxDb.Hsapiens.UCSC.hg38.knownGene, with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package

src <- src_organism(dbpath = hg38light())
src
## src:  sqlite 3.19.3 [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

3 The “dplyr” interface

All methods from package dplyr can be used for a src_organism object.

Look at all available tables.

src_tbls(src)
##  [1] "id_accession"  "id_transcript" "id"            "id_omim_pm"   
##  [5] "id_protein"    "id_go"         "id_go_all"     "ranges_gene"  
##  [9] "ranges_tx"     "ranges_exon"   "ranges_cds"

Look at data from one specific table.

tbl(src, "id")
## # Source:   table<id> [?? x 6]
## # Database: sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
##    entrez      map         ensembl symbol               genename    alias
##     <chr>    <chr>           <chr>  <chr>                  <chr>    <chr>
##  1      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein      A1B
##  2      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein      ABG
##  3      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein      GAB
##  4      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein HYST2477
##  5      1  19q13.4 ENSG00000121410   A1BG alpha-1-B glycoprotein     A1BG
##  6     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2     AAC2
##  7     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2    NAT-2
##  8     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2     PNAT
##  9     10     8p22 ENSG00000156006   NAT2  N-acetyltransferase 2     NAT2
## 10    100 20q13.12 ENSG00000196839    ADA    adenosine deaminase      ADA
## # ... with more rows

Look at fields of one table.

colnames(tbl(src, "id"))
## [1] "entrez"   "map"      "ensembl"  "symbol"   "genename" "alias"

Below are some examples of querying tables using dplyr.

  1. Gene symbol starting with “SNORD” (the notation SNORD% is from SQL, with % representing a wild-card match to any string)
tbl(src, "id") %>%
    filter(symbol %like% "SNORD%") %>%
    dplyr::select(entrez, map, ensembl, symbol) %>%
    distinct() %>% arrange(symbol) %>% collect()
## # A tibble: 9 x 4
##      entrez     map         ensembl     symbol
##       <chr>   <chr>           <chr>      <chr>
## 1 100033413 15q11.2 ENSG00000207063 SNORD116-1
## 2 100033414 15q11.2 ENSG00000207001 SNORD116-2
## 3 100033415 15q11.2 ENSG00000207014 SNORD116-3
## 4 100033416 15q11.2 ENSG00000275529 SNORD116-4
## 5 100033417 15q11.2 ENSG00000207191 SNORD116-5
## 6 100033418 15q11.2 ENSG00000207442 SNORD116-6
## 7 100033419 15q11.2 ENSG00000207133 SNORD116-7
## 8 100033420 15q11.2 ENSG00000207093 SNORD116-8
## 9 100033421 15q11.2 ENSG00000206727 SNORD116-9
  1. Gene ontology (GO) info for gene symbol “ADA”
inner_join(tbl(src, "id"), tbl(src, "id_go")) %>%
    filter(symbol == "ADA") %>%
    dplyr::select(entrez, ensembl, symbol, go, evidence, ontology)
## Joining, by = "entrez"
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
##    entrez         ensembl symbol         go evidence ontology
##     <chr>           <chr>  <chr>      <chr>    <chr>    <chr>
##  1    100 ENSG00000196839    ADA GO:0001666      IDA       BP
##  2    100 ENSG00000196839    ADA GO:0001821      IEA       BP
##  3    100 ENSG00000196839    ADA GO:0001829      IEA       BP
##  4    100 ENSG00000196839    ADA GO:0001883      IEA       MF
##  5    100 ENSG00000196839    ADA GO:0001889      IEA       BP
##  6    100 ENSG00000196839    ADA GO:0001890      IEA       BP
##  7    100 ENSG00000196839    ADA GO:0002314      IEA       BP
##  8    100 ENSG00000196839    ADA GO:0002636      IEA       BP
##  9    100 ENSG00000196839    ADA GO:0002686      IEA       BP
## 10    100 ENSG00000196839    ADA GO:0002906      IEA       BP
## # ... with more rows
  1. Gene transcript counts per gene symbol
txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>%
    dplyr::select(symbol, tx_id) %>%
    group_by(symbol) %>%
    summarise(count = count(symbol)) %>%
    dplyr::select(symbol, count) %>%
    arrange(desc(count)) %>%
    collect(n=Inf)
## Joining, by = "entrez"
txcount
## # A tibble: 18 x 2
##        symbol count
##         <chr> <int>
##  1       AKT3   270
##  2   RNA5-8S5   128
##  3    NAALAD2    50
##  4       A1BG    40
##  5       MED6    39
##  6       CDH2    25
##  7      NR2E3    18
##  8    RNA28S5    18
##  9       NAT2     8
## 10   ANO1-AS2     6
## 11 SNORD116-2     4
## 12 SNORD116-3     4
## 13 SNORD116-5     4
## 14        ADA     3
## 15 SNORD116-1     3
## 16 SNORD116-4     2
## 17 SNORD116-8     2
## 18 ZBTB11-AS1     1
library(ggplot2)
ggplot(txcount, aes(x = symbol, y = count)) + 
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Transcript count") +
    labs(x = "Symbol") +
    labs(y = "Count")

  1. Gene coordinates of symbol “ADA” and “NAT2” as GRanges
inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
    filter(symbol %in% c("ADA", "NAT2")) %>%
    dplyr::select(gene_chrom, gene_start, gene_end, gene_strand,
                  symbol, map) %>%
    collect() %>% GenomicRanges::GRanges()
## Joining, by = "entrez"
## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames               ranges strand |      symbol         map
##          <Rle>            <IRanges>  <Rle> | <character> <character>
##   [1]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [2]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [3]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [4]     chr8 [18391245, 18401218]      + |        NAT2        8p22
##   [5]    chr20 [44619522, 44651742]      - |         ADA    20q13.12
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

4 The “select” interface

Methods select(), keytypes(), keys(), columns() and mapIds from AnnotationDbi are implemented for src_organism objects.

Use keytypes() to discover which keytypes can be passed to keytype argument of methods select() or keys().

keytypes(src)
##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"     
##  [5] "cds_id"       "cds_name"     "cds_start"    "cds_strand"  
##  [9] "ensembl"      "ensemblprot"  "ensembltrans" "entrez"      
## [13] "enzyme"       "evidence"     "evidenceall"  "exon_chrom"  
## [17] "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"    
## [25] "gene_start"   "gene_strand"  "genename"     "go"          
## [29] "goall"        "ipi"          "map"          "omim"        
## [33] "ontology"     "ontologyall"  "pfam"         "pmid"        
## [37] "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"    
## [45] "tx_strand"    "tx_type"      "unigene"      "uniprot"

Use columns() to discover which kinds of data can be returned for the src_organism object.

columns(src)
##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"     
##  [5] "cds_id"       "cds_name"     "cds_start"    "cds_strand"  
##  [9] "ensembl"      "ensemblprot"  "ensembltrans" "entrez"      
## [13] "enzyme"       "evidence"     "evidenceall"  "exon_chrom"  
## [17] "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"    
## [25] "gene_start"   "gene_strand"  "genename"     "go"          
## [29] "goall"        "ipi"          "map"          "omim"        
## [33] "ontology"     "ontologyall"  "pfam"         "pmid"        
## [37] "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"    
## [45] "tx_strand"    "tx_type"      "unigene"      "uniprot"

keys() returns keys for the src_organism object. By default it returns the primary keys for the database, and returns the keys from that keytype when the keytype argument is used.

Keys of entrez

head(keys(src))
## [1] "1"         "10"        "100"       "1000"      "10000"     "100008586"

Keys of symbol

head(keys(src, "symbol"))
## [1] "A1BG"     "ADA"      "AKT3"     "ANO1-AS2" "CDH2"     "DUXB"

select() retrieves the data as a tibble based on parameters for selected keys columns and keytype arguments. If requested columns that have multiple matches for the keys, select_tbl() will return a tibble with one row for each possible match, and select() will return a data frame.

keytype <- "symbol"
keys <- c("ADA", "NAT2")
columns <- c("entrez", "tx_id", "tx_name","exon_id")
select_tbl(src, keys, columns, keytype)
## Joining, by = "entrez"
## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
##    symbol entrez  tx_id    tx_name exon_id
##     <chr>  <chr>  <int>      <chr>   <int>
##  1    ADA    100 169786 uc061xfj.1  501379
##  2    ADA    100 169786 uc061xfj.1  501383
##  3    ADA    100 169786 uc061xfj.1  501385
##  4    ADA    100 169786 uc061xfj.1  501387
##  5    ADA    100 169786 uc061xfj.1  501389
##  6    ADA    100 169786 uc061xfj.1  501390
##  7    ADA    100 169786 uc061xfj.1  501392
##  8    ADA    100 169787 uc002xmj.4  501379
##  9    ADA    100 169787 uc002xmj.4  501382
## 10    ADA    100 169787 uc002xmj.4  501384
## # ... with more rows

mapIds() gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector.

mapIds(src, keys, column = "tx_name", keytype)
## Joining, by = "entrez"
##          ADA         NAT2 
## "uc002xmj.4" "uc003wyw.2"
mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList")
## Joining, by = "entrez"
## CharacterList of length 2
## [["ADA"]] uc002xmj.4 uc061xfj.1 uc061xfl.1
## [["NAT2"]] uc003wyw.2 uc064kqw.1

5 The “GRanges” interface

Eleven genomic coordinates extractor methods are available in this package: transcripts(), exons(), cds(), genes(), promoters(), transcriptsBy(), exonsBy(), cdsBy(), intronsByTranscript(), fiveUTRsByTranscript(), threeUTRsByTranscript(). Data can be returned in two versions, for instance tibble (transcripts_tbl()) and GRanges or GRangesList (transcripts()).

Filters can be applied to all extractor functions. A named list of vectors can be used to restrict the output, valid filters can be retrieved by supportedFilters().

supportedFilters()
##  [1] "AccnumFilter"       "AliasFilter"        "CdsChromFilter"    
##  [4] "CdsNameFilter"      "CdsStrandFilter"    "EnsemblFilter"     
##  [7] "EnsemblprotFilter"  "EnsembltransFilter" "EntrezFilter"      
## [10] "EnzymeFilter"       "EvidenceFilter"     "EvidenceallFilter" 
## [13] "ExonChromFilter"    "ExonNameFilter"     "ExonStrandFilter"  
## [16] "FlybaseFilter"      "FlybaseCgFilter"    "FlybaseProtFilter" 
## [19] "GeneChromFilter"    "GeneStrandFilter"   "GenenameFilter"    
## [22] "GoFilter"           "GoallFilter"        "IpiFilter"         
## [25] "MapFilter"          "MgiFilter"          "OmimFilter"        
## [28] "OntologyFilter"     "OntologyallFilter"  "PfamFilter"        
## [31] "PmidFilter"         "PrositeFilter"      "RefseqFilter"      
## [34] "SymbolFilter"       "TxChromFilter"      "TxNameFilter"      
## [37] "TxStrandFilter"     "TxTypeFilter"       "UnigeneFilter"     
## [40] "UniprotFilter"      "WormbaseFilter"     "ZfinFilter"        
## [43] "CdsIdFilter"        "CdsStartFilter"     "CdsEndFilter"      
## [46] "ExonIdFilter"       "ExonStartFilter"    "ExonEndFilter"     
## [49] "ExonRankFilter"     "GeneStartFilter"    "GeneEndFilter"     
## [52] "TxIdFilter"         "TxStartFilter"      "TxEndFilter"

All filters take two parameters: value and condition, condition could be one of “==”, “!=”, “startsWith”, “endsWith”, “>”, “<”, “>=” and “<=”, default condition is “==”.

EnsemblFilter("ENSG00000196839")
## class: EnsemblFilter 
## condition: == 
## value: ENSG00000196839
SymbolFilter("SNORD", "startsWith")
## class: SymbolFilter 
## condition: startsWith 
## value: SNORD

A GRangesFilter() can also be used as filter for the methods with result displaying as GRanges or GRangesList.

filters <- list(SymbolFilter("SNORD", "startsWith"))
transcripts_tbl(src, filter=filters)
## Joining, by = "entrez"
## # Source:     lazy query [?? x 7]
## # Database:   sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## # Ordered by: tx_id
##   tx_chrom tx_start   tx_end tx_strand  tx_id    tx_name     symbol
##      <chr>    <int>    <int>     <chr>  <int>      <chr>      <chr>
## 1    chr15 25051477 25051571         + 123215 uc001yxg.4 SNORD116-1
## 2    chr15 25054210 25054304         + 123216 uc001yxi.4 SNORD116-2
## 3    chr15 25056860 25056954         + 123217 uc001yxk.4 SNORD116-3
## 4    chr15 25059538 25059633         + 123218 uc001yxl.5 SNORD116-4
## 5    chr15 25062333 25062427         + 123219 uc001yxo.4 SNORD116-5
## 6    chr15 25065026 25065121         + 123221 uc001yxp.5 SNORD116-2
## 7    chr15 25067788 25067882         + 123222 uc059gus.1 SNORD116-5
## 8    chr15 25070432 25070526         + 123223 uc001yxr.4 SNORD116-8
## 9    chr15 25073107 25073201         + 123224 uc001yxs.4 SNORD116-3
filters <- list(
    SymbolFilter("SNORD", "startsWith"),
    GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121"))
)
transcripts(src, filter=filters)
## Joining, by = "entrez"
## GRanges object with 2 ranges and 3 metadata columns:
##       seqnames               ranges strand |     tx_id     tx_name
##          <Rle>            <IRanges>  <Rle> | <integer> <character>
##   [1]    chr15 [25062333, 25062427]      + |    123219  uc001yxo.4
##   [2]    chr15 [25065026, 25065121]      + |    123221  uc001yxp.5
##            symbol
##       <character>
##   [1]  SNORD116-5
##   [2]  SNORD116-2
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

Transcript coordinates of gene symbol equal to “ADA” and transcript start position between 87863438 and 87933487.

transcripts_tbl(src, filter = list(
    SymbolFilter("ADA"),
    TxStartFilter(44619810,"<")
))
## Joining, by = "entrez"
## # Source:     lazy query [?? x 7]
## # Database:   sqlite 3.19.3
## #   [/tmp/RtmpzDFgaQ/Rinst69bc2284cb0/Organism.dplyr/extdata/light.hg38.knownGene.sqlite]
## # Ordered by: tx_id
##   tx_chrom tx_start   tx_end tx_strand  tx_id    tx_name symbol
##      <chr>    <int>    <int>     <chr>  <int>      <chr>  <chr>
## 1    chr20 44619522 44626491         - 169786 uc061xfj.1    ADA
## 2    chr20 44619522 44651742         - 169787 uc002xmj.4    ADA
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2         ggplot2_2.2.1        GenomicRanges_1.28.6
##  [4] GenomeInfoDb_1.12.3  IRanges_2.10.5       S4Vectors_0.14.7    
##  [7] BiocGenerics_0.22.1  Organism.dplyr_1.2.2 dplyr_0.7.4         
## [10] BiocStyle_2.4.1     
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.6.5 lattice_0.20-35           
##  [3] colorspace_1.3-2           htmltools_0.3.6           
##  [5] BiocFileCache_1.0.1        rtracklayer_1.36.6        
##  [7] yaml_2.1.14                GenomicFeatures_1.28.5    
##  [9] blob_1.1.0                 XML_3.98-1.9              
## [11] rlang_0.1.2                glue_1.1.1                
## [13] DBI_0.7                    BiocParallel_1.10.1       
## [15] rappdirs_0.3.1             bit64_0.9-7               
## [17] dbplyr_1.1.0               plyr_1.8.4                
## [19] matrixStats_0.52.2         GenomeInfoDbData_0.99.0   
## [21] bindr_0.1                  stringr_1.2.0             
## [23] zlibbioc_1.22.0            Biostrings_2.44.2         
## [25] munsell_0.4.3              gtable_0.2.0              
## [27] memoise_1.1.0              evaluate_0.10.1           
## [29] labeling_0.3               Biobase_2.36.2            
## [31] knitr_1.17                 biomaRt_2.32.1            
## [33] AnnotationDbi_1.38.2       Rcpp_0.12.13              
## [35] scales_0.5.0               backports_1.1.1           
## [37] DelayedArray_0.2.7         XVector_0.16.0            
## [39] bit_1.1-12                 Rsamtools_1.28.0          
## [41] digest_0.6.12              stringi_1.1.5             
## [43] grid_3.4.2                 rprojroot_1.2             
## [45] tools_3.4.2                bitops_1.0-6              
## [47] magrittr_1.5               lazyeval_0.2.0            
## [49] RCurl_1.95-4.8             tibble_1.3.4              
## [51] RSQLite_2.0                pkgconfig_2.0.1           
## [53] Matrix_1.2-11              assertthat_0.2.0          
## [55] rmarkdown_1.6              httr_1.3.1                
## [57] R6_2.2.2                   GenomicAlignments_1.12.2  
## [59] compiler_3.4.2