Creating data metrics object

0.1 Data metrics object

Information about the dataMetrics object can be found in the article Data metrics object. If a researcher is using the bigPint package to plot RNA-seq data, then many will create the dataMetrics object using popular RNA-seq packages such as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), and limma (Ritchie et al. 2015). These R packages will output several interesting quantitative variables for each gene in the dataset, such as log fold change, p-value, and FDR. These quantitative variables can be incorporated into the dataMetrics object and determine by thresholds what subset of genes will be superimposed onto plots.

0.2 Example: two treatments

The following example shows how to create the dataMetrics object called soybean_ir_sub_metrics, which was shown in the article Data metrics object (Lauter and Graham 2016). The dataset from which it is created (soybean_ir_sub) contains only two treatment groups, N and P. In this case, the edgeR (Robinson, McCarthy, and Smyth 2010) package was primarily followed.

library(bigPint)
library(edgeR)
library(data.table)

data(soybean_ir_sub)
data = soybean_ir_sub
rownames(data) = data[,1]

y = DGEList(counts=data[,-1])
group = c(1,1,1,2,2,2)
y = DGEList(counts=y, group=group)
Group = factor(c(rep("N",3), rep("P",3)))
design <- model.matrix(~0+Group, data=y$samples)
colnames(design) <- levels(Group)
y <- estimateDisp(y, design)
fit <- glmFit(y, design)

soybean_ir_sub_metrics <- list()

for (i in 1:(ncol(fit)-1)){
  for (j in (i+1):ncol(fit)){
    contrast=rep(0,ncol(fit))
    contrast[i]=1
    contrast[j]=-1
    lrt <- glmLRT(fit, contrast=contrast)
    lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]]
    
    setDT(lrt, keep.rownames = TRUE)[]
    colnames(lrt)[1] = "ID"
    lrt <- as.data.frame(lrt)
    
    soybean_ir_sub_metrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt
  }
}

We can indeed examine the generated soybean_ir_sub_metrics object as follows:

str(soybean_ir_sub_metrics, strict.width = "wrap")

## List of 1
## $ N_P:'data.frame': 5604 obs. of 6 variables:
## ..$ ID : chr [1:5604] "Glyma.19G168700.Wm82.a2.v1" "Glyma.13G293500.Wm82.a2.v1"
##    "Glyma.05G188700.Wm82.a2.v1" "Glyma.13G173100.Wm82.a2.v1" ...
## ..$ logFC : num [1:5604] -5.92 2.99 -3.51 -3.91 -3.51 ...
## ..$ logCPM: num [1:5604] 7.52 8.08 8.83 8.27 10.19 ...
## ..$ LR : num [1:5604] 266 171 167 157 154 ...
## ..$ PValue: num [1:5604] 9.18e-60 3.65e-39 2.73e-38 6.04e-36 2.58e-35 ...
## ..$ FDR : num [1:5604] 5.14e-56 1.02e-35 5.09e-35 8.46e-33 2.89e-32 ...

And verify that it contains one list element:

names(soybean_ir_sub_metrics)

## [1] "N_P"

0.3 Example: three treatments

The following example shows how to create the dataMetrics object called soybean_cn_sub_metrics, which was shown in the article Data metrics object). The dataset from which it is created (soybean_cn_sub) contains three treatment groups, S1, S2, and S3 (Brown and Hudson 2015). In this case, the edgeR (Robinson, McCarthy, and Smyth 2010) package was primarily followed.

library(edgeR)
library(data.table)

data(soybean_cn_sub)
data = soybean_cn_sub
rownames(data) = data[,1]

y = DGEList(counts=data[,-1])
group = c(1,1,1,2,2,2,3,3,3)
y = DGEList(counts=y, group=group)
Group = factor(c(rep("S1",3), rep("S2",3), rep("S3",3)))
design <- model.matrix(~0+Group, data=y$samples)
colnames(design) <- levels(Group)
y <- estimateDisp(y, design)
fit <- glmFit(y, design)

soybean_cn_sub_metrics <- list()

for (i in 1:(ncol(fit)-1)){
  for (j in (i+1):ncol(fit)){
    contrast=rep(0,ncol(fit))
    contrast[i]=1
    contrast[j]=-1
    lrt <- glmLRT(fit, contrast=contrast)
    lrt <- topTags(lrt, n = nrow(y[[1]]))[[1]]
    
    setDT(lrt, keep.rownames = TRUE)[]
    colnames(lrt)[1] = "ID"
    lrt <- as.data.frame(lrt)
    
    soybean_cn_sub_metrics[[paste0(colnames(fit)[i], "_", colnames(fit)[j])]] <- lrt
  }
}

We can indeed examine the generated soybean_cn_sub_metrics object as follows:

str(soybean_cn_sub_metrics, strict.width = "wrap")

## List of 3
## $ S1_S2:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma18g00690.1" "Glyma08g22380.1" "Glyma20g30460.1"
##    "Glyma07g09700.1" ...
## ..$ logFC : num [1:7332] 3.14 2.62 2.5 2.37 2.74 ...
## ..$ logCPM: num [1:7332] 8.04 8.05 8.14 8.19 7.93 ...
## ..$ LR : num [1:7332] 29.3 24.8 24 22.7 21 ...
## ..$ PValue: num [1:7332] 6.08e-08 6.50e-07 9.82e-07 1.94e-06 4.57e-06 ...
## ..$ FDR : num [1:7332] 0.000446 0.002384 0.0024 0.003557 0.006697 ...
## $ S1_S3:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma08g22380.1" "Glyma08g19290.1" "Glyma20g30460.1"
##    "Glyma18g00690.1" ...
## ..$ logFC : num [1:7332] 3.18 3.32 2.44 2.46 3.03 ...
## ..$ logCPM: num [1:7332] 8.05 8.14 8.14 8.04 7.85 ...
## ..$ LR : num [1:7332] 30.4 24.7 23.3 22.5 22.3 ...
## ..$ PValue: num [1:7332] 3.60e-08 6.53e-07 1.42e-06 2.09e-06 2.30e-06 ...
## ..$ FDR : num [1:7332] 0.000264 0.002393 0.003378 0.003378 0.003378 ...
## $ S2_S3:'data.frame': 7332 obs. of 6 variables:
## ..$ ID : chr [1:7332] "Glyma08g14670.3" "Glyma08g14670.2" "Glyma08g19290.1"
##    "Glyma06g46701.1" ...
## ..$ logFC : num [1:7332] 2.66 2.72 2.78 2.19 2.29 ...
## ..$ logCPM: num [1:7332] 7.86 7.82 8.14 7.63 7.84 ...
## ..$ LR : num [1:7332] 16.1 16 14.3 11.8 10.5 ...
## ..$ PValue: num [1:7332] 6.01e-05 6.30e-05 1.57e-04 5.81e-04 1.21e-03 ...
## ..$ FDR : num [1:7332] 0.231 0.231 0.385 1 1 ...

And verify that it contains three list element:

names(soybean_cn_sub_metrics)

## [1] "S1_S2" "S1_S3" "S2_S3"

References

Brown, Anne V., and Karen A. Hudson. 2015. “Developmental Profiling of Gene Expression in Soybean Trifoliate Leaves and Cotyledons.” BMC Plant Biology 15 (1). BioMed Central:169.

Lauter, AN Moran, and MA Graham. 2016. “NCBI Sra Bioproject Accession: PRJNA318409.”

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12). BioMed Central:550.

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for Rna-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7). Oxford University Press:e47–e47.

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1). Oxford University Press:139–40.

Creating data metrics object

2020-03-18

Package

0.1 Data metrics object

0.2 Example: two treatments

0.3 Example: three treatments

References