Contents

1 Introduction

1.1 The Comparative Toxicogenomics Database

The Comparative Toxicogenomics Database (CTDbase; http://ctdbase.org) is a public resource for toxicogenomic information manually curated from the peer-reviewed scientific literature, providing key information about the interactions of environmental chemicals with gene products and their effect on human disease [1][2].

1.2 CTDquerier R package

CTDquerier is an R package that allows to R users to download basic data from CTDbase about genes, chemicals and diseases. Once the user’s input is validated allows to query CTDbase to download the information of the given input from the other modules.

2 Quering The Comparative Toxicogenomics Database

CTDbase offers a public web-based interface that includes basic and advanced query options to access data for sequences, references, and toxic agents, and a platform for analysis sequences.

2.2 Batch Query

The Batch Query tool (http://ctdbase.org/tools/batchQuery.go) is a provided by CTDbase and allows to download custom data associated with a set of chemicals, diseases and genes amount others.

The Comparative Toxicogenomics Database - Batch Query

The Comparative Toxicogenomics Database - Batch Query

Given a set of terms the tool allows to download (as .tsv, .xml, …) curated or inferred data from CTDbase associated to the terms of interest. Table 1 indicates the type of available data depending on input terms, being C curated, I inferred, E enriched and A all.

Table 1: Type of available data in Batch Query depending on type of input terms
Data Available/Input Data Chemicals Diseases Genes
Chemical–gene interactions C C
Chemical associations A,C,I C
Gene associations C A,C,I C
Disease associations A,C,I A,C,I
Pathway associations I,E I C
Gene Ontology associations A,E A

The resulting tables obtained from querying CTDbase using the Batch Query tool with the gene XKR4 and asking for associated chemicals and associated diseases (curated, inferred and all) are included in CTDquerier R package (queries performed 2018/JAN/02).

These four files can be loaded as follows:

# Chemicals - XKR4
bq_xkr4_c <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_chem.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_c, sep = "\t" ) )
## [1] 18
# Diseses curated - XKR4
bq_xkr4_dC <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_disease_curated.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_dC, sep = "\t" ) )
## [1] 1
# Diseases inferred - XKR4
bq_xkr4_dI <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_disease_inferred.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_dI, sep = "\t" ) )
## [1] 1339
# Diseases all - XKR4
bq_xkr4_dA <- system.file(
  paste0( "extdata", .Platform$file.sep, "bq_xkr4_disease_all.tsv" ), 
  package="CTDquerier"
)
nrow( read.delim( bq_xkr4_dA, sep = "\t" ) )
## [1] 1340

What we can see from these files is that XKR4 has, according to CTDbase, 18 curated associations with chemicals, 1 curated association with diseases, 1339 inferred associations with diseases and 1340 association with diseases (including both curated and inferred). It must be said that these associations are not unique.

2.3 CTDquerier

The CTDquerier allows to download the associated information to a single or a set of genes by ysing the function query_ctd_gene:

library( CTDquerier )
xkr4 <- query_ctd_gene( terms = "XKR4", verbose = TRUE )
## Warning in .get_cache(): /home/biocbuild/.cache/CTDQuery
## Using temporary cache /tmp/RtmpRHEOpp/BiocFileCache
## Downloading GENE vocabulary from CTDbase
## Loading gene vocabulary.
## Warning in .get_cache(): /home/biocbuild/.cache/CTDQuery
## Using temporary cache /tmp/RtmpRHEOpp/BiocFileCache
## 1/tmp/RtmpRHEOpp/BiocFileCache/592453443d9_CTD_genes.tsv.gz
## Warning in load_ctd_gene(): 1/tmp/RtmpRHEOpp/BiocFileCache/
## 592453443d9_CTD_genes.tsv.gz
## 1/tmp/RtmpRHEOpp/BiocFileCache/592453443d9_CTD_genes.tsv.gz
## Warning in load_ctd_gene(): 1/tmp/RtmpRHEOpp/BiocFileCache/
## 592453443d9_CTD_genes.tsv.gz
## Staring query for gene 'XKR4' ( 114786 )
##  . Downloading 'gene-gene interaction' table.
##  . Downloading 'disease' table.
##  . Downloading 'gene-chemical interaction' table.
##  . Downloading 'GO terms' table.
##  . Downloading 'KEGG pathways' table.
##  . . No 'KEGG pathways' table available for XKR4' ( 114786 )
xkr4
## Object of class 'CTDdata'
## -------------------------
##  . Type: GENE 
##  . Length: 1 
##  . Items: XKR4 
##  . Diseases: 800 ( NA / 800 )
##  . Gene-gene interactions: 1 ( 1 )
##  . Gene-chemical interactions: 19 ( 30 )
##  . KEGG pathways: 0 (-)
##  . GO terms: 2 ( 2 )

The query indicates that 25 gene-chemical interactions were downloaded from CTDbase. Takeing a close look to them we see that they corrsponds to the 18 chemicals obtained from Batch Query tool.

# How many unique chemicals associations there are in the result object?
xkr4_chem <- get_table( xkr4, index_name = "chemical interactions" )
length( unique( xkr4_chem$Chemical.Name ) )
## [1] 19
# How many of the chemicals download using CTDquerier are in the Batch Query files?
bq_xkr4_c <- read.delim( bq_xkr4_c, sep = "\t" )
sum( as.character( bq_xkr4_c[ , 2] ) %in% unique( xkr4_chem$Chemical.Name ) )
## [1] 16

On the side of disease associations, the retrieved data for XKR4 with CTDqurier indicates that there are 762 gene-disease associations.

dim( get_table( xkr4, index_name = "diseases" ) )
## [1] 800   8

These 762 gene-disease assocations corresponds to the 1340 obtained from Batch Query one filtered by unique disease:

bq_xkr4_dA <- read.delim( bq_xkr4_dA, sep = "\t" )
length( unique( bq_xkr4_dA$DiseaseID ) )
## [1] 762
sum( as.character( unique( bq_xkr4_dA$DiseaseID ) ) %in% 
    get_table( xkr4, index_name = "diseases" )$Disease.ID )
## [1] 761

The diference in terms of numbers of associations between the results obtained from Batch Query and from CTDquerier corresponds to the way the chemicals are nested in both tables. While in the results from Batch Query there is a row for each associations:

bq_xkr4_dA[1:3, ]
##   X..Input    DiseaseName    DiseaseID GeneSymbol GeneID  DiseaseCategories
## 1     xkr4 Abdominal Pain MESH:D015746       XKR4 114786 Signs and symptoms
## 2     xkr4 Abdominal Pain MESH:D015746       XKR4 114786 Signs and symptoms
## 3     xkr4 Abdominal Pain MESH:D015746       XKR4 114786 Signs and symptoms
##   DiseaseCategories.1 DirectEvidence InferenceChemicalName InferenceScore
## 1  Signs and symptoms                     Propylthiouracil          10.25
## 2  Signs and symptoms                            Tretinoin          10.25
## 3  Signs and symptoms                        Valproic Acid          10.25
##   OmimIDs         PubMedIDs
## 1      NA 15822032|15879050
## 2      NA           9234591
## 3      NA           6206716

In the results from CTDquerier there is a single entry for the disease instead one for each disease-chemical we see in the previous table from Batch Query. This is seen since in the results from CTDquerier there is a single entry for Abdominal Pain and has the three chemicals in a single string into the column Inference.Network:

tbl <- get_table( xkr4, index_name = "diseases" )
tbl[ tbl$Disease.ID == "MESH:D015746", "Inference.Network" ]
## [1] "Arsenic|Propylthiouracil|Tretinoin|Valproic Acid"

3 Session Info.

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] CTDquerier_1.5.0 BiocStyle_2.13.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1          compiler_3.6.0      pillar_1.3.1       
##  [4] BiocManager_1.30.4  dbplyr_1.4.0        bitops_1.0-6       
##  [7] tools_3.6.0         digest_0.6.18       bit_1.1-14         
## [10] BiocFileCache_1.9.0 RSQLite_2.1.1       evaluate_0.13      
## [13] memoise_1.1.0       tibble_2.1.1        pkgconfig_2.0.2    
## [16] rlang_0.3.4         DBI_1.0.0           curl_3.3           
## [19] yaml_2.2.0          parallel_3.6.0      xfun_0.6           
## [22] httr_1.4.0          stringr_1.4.0       dplyr_0.8.0.1      
## [25] knitr_1.22          S4Vectors_0.23.0    rappdirs_0.3.1     
## [28] tidyselect_0.2.5    stats4_3.6.0        bit64_0.9-7        
## [31] glue_1.3.1          R6_2.4.0            rmarkdown_1.12     
## [34] bookdown_0.9        purrr_0.3.2         blob_1.1.1         
## [37] magrittr_1.5        htmltools_0.3.6     BiocGenerics_0.31.0
## [40] stringdist_0.9.5.1  assertthat_0.2.1    stringi_1.4.3      
## [43] RCurl_1.95-4.12     crayon_1.3.4

Bibliography

1. Mattingly CJ FJ Colby GT. The comparative toxicogenomics database (ctd). 2003.

2. Davis AP JR Grondin CJ. The comparative toxicogenomics database: Update 2017. 2017.