systemPipeR 1.20.0
Ribo-Seq and polyRibo-Seq are a specific form of RNA-Seq gene expression experiments utilizing mRNA subpopulations directly bound to ribosomes. Compared to standard RNA-Seq, their readout of gene expression provides a better approximation of downstream protein abundance profiles due to their close association with translational processes. The most important difference among the two is that polyRibo-Seq utilizes polyribosomal RNA for sequencing, whereas Ribo-Seq is a footprinting approach restricted to sequencing RNA fragments protected by ribosomes (Ingolia et al. 2009; Aspden et al. 2014; Juntawong et al. 2015).
The workflow presented in this vignette contains most of the data analysis steps described by (Juntawong et al. 2014) including functionalities useful for processing both polyRibo-Seq and Ribo-Seq experiments. To improve re-usability and adapt to recent changes of software versions (e.g. R, Bioconductor and short read aligners), the code has been optimized accordingly. Thus, the results obtained with the updated workflow are expected to be similar but not necessarily identical with the published results described in the original paper.
Relevant analysis steps of this workflow include read preprocessing, read
alignments against a reference genome, counting of reads overlapping with a
wide range of genomic features (e.g. CDSs, UTRs, uORFs, rRNAs, etc.),
differential gene expression and differential ribosome binding analyses, as
well as a variety of genome-wide summary plots for visualizing RNA expression
trends. Functions are provided for evaluating the quality of Ribo-seq data,
for identifying novel expressed regions in the genomes, and for gaining
insights into gene regulation at the post-transcriptional and translational
levels. For example, the functions genFeatures
and
featuretypeCounts
can be used to quantify the expression output for
all feature types included in a genome annotation (e.g.
genes,
introns, exons, miRNAs, intergenic regions, etc.). To determine the approximate
read length of ribosome footprints in Ribo-Seq experiments, these feature type
counts can be obtained and plotted for specific read lengths separately.
Typically, the most abundant read length obtained for translated features
corresponds to the approximate footprint length occupied by the ribosomes of a
given organism group. Based on the results from several Ribo-Seq studies, these
ribosome footprints are typically ~30 nucleotides long
(Ingolia, Lareau, and Weissman 2011; Ingolia et al. 2009; Juntawong et al. 2014). However, their
length can vary by several nucleotides depending upon the optimization of the
RNA digestion step and various factors associated with translational
regulation. For quality control purposes of Ribo-Seq experiments it is also
useful to monitor the abundance of reads mapping to rRNA genes due to the high
rRNA content of ribosomes. This information can be generated with the
featuretypeCounts
function described above.
Coverage trends along transcripts summarized for any number of transcripts can
be obtained and plotted with the functions featureCoverage
and
plotfeatureCoverage
, respectively. Their results allow monitoring
of the phasing of ribosome movements along triplets of coding sequences.
Commonly, high quality data will display here for the first nucleotide of each
codon the highest depth of coverage computed for the 5’ ends of the aligned
reads.
Ribo-seq data can also be used to evaluate various aspects of translational
control due to ribosome occupancy in upstream open reading frames (uORFs). The
latter are frequently present in (or near) 5’ UTRs of transcripts. For this,
the function predORFs
can be used to identify ORFs in the
nucleotide sequences of transcripts or their subcomponents such as UTR regions.
After scaling the resulting ORF coordinates back to the corresponding genome
locations using scaleRanges
, one can use these novel features
(e.g. uORFs) for expression analysis routines similar to those
employed for pre-existing annotations, such as the exonic regions of genes. For
instance, in Ribo-Seq experiments one can use this approach to systematically identify all transcripts occupied by ribosomes in their uORF regions. The binding of
ribosomes to uORF regions may indicate a regulatory role in the translation of
the downstream main ORFs and/or translation of the uORFs into functionally
relevant peptides.
Typically, users want to specify here all information relevant for the analysis of their NGS study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.
The following loads one of the available NGS workflow templates (here RIBO-Seq) into the user’s current working directory. At the moment, the package includes workflow templates for RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. Templates for additional NGS applications will be provided in the future. The sample data are described here.
library(systemPipeRdata)
genWorkenvir(workflow = "riboseq", bam = TRUE)
setwd("riboseq")
In case there is a newer version of this tutorial, download its systemPipeRIBOseq.Rmd
source and open it in your R IDE (e.g. vim-r or RStudio).
download.file("https://raw.githubusercontent.com/tgirke/systemPipeRdata/master/vignettes/systemPipeRIBOseq.Rmd",
"systemPipeRIBOseq.Rmd")
The systemPipeR
package needs to be loaded to perform the analysis
steps shown in this report (Girke 2014). The package allows users
to run the entire analysis workflow interactively or with a single command
while also generating the corresponding analysis report. For details
see systemPipeR's
main vignette.
library(systemPipeR)
In the workflow environments generated by genWorkenvir
all data
inputs are stored in a data/
directory and all analysis results will
be written to a separate results/
directory, while the
systemPipeRIBOseq.Rmd
script and the targets
file are
expected to be located in the parent directory. The R session is expected to
run from this parent directory. Additional parameter files are stored under
param/
.
To work with real data, users want to organize their own data similarly and
substitute all test data for their own data. To rerun an established workflow
on new data, the initial targets
file along with the corresponding
FASTQ files are usually the only inputs the user needs to provide.
If applicable users can load custom functions not provided by
systemPipeR
. Skip this step if this is not the case.
source("systemPipeRIBOseq_Fct.R")
targets
fileThe targets
file defines all FASTQ files and sample comparisons of the analysis workflow.
targetspath <- system.file("extdata", "targets.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")[, 1:4]
targets
## FileName SampleName Factor SampleLong
## 1 ./data/SRR446027_1.fastq.gz M1A M1 Mock.1h.A
## 2 ./data/SRR446028_1.fastq.gz M1B M1 Mock.1h.B
## 3 ./data/SRR446029_1.fastq.gz A1A A1 Avr.1h.A
## 4 ./data/SRR446030_1.fastq.gz A1B A1 Avr.1h.B
## 5 ./data/SRR446031_1.fastq.gz V1A V1 Vir.1h.A
## 6 ./data/SRR446032_1.fastq.gz V1B V1 Vir.1h.B
## 7 ./data/SRR446033_1.fastq.gz M6A M6 Mock.6h.A
## 8 ./data/SRR446034_1.fastq.gz M6B M6 Mock.6h.B
## 9 ./data/SRR446035_1.fastq.gz A6A A6 Avr.6h.A
## 10 ./data/SRR446036_1.fastq.gz A6B A6 Avr.6h.B
## 11 ./data/SRR446037_1.fastq.gz V6A V6 Vir.6h.A
## 12 ./data/SRR446038_1.fastq.gz V6B V6 Vir.6h.B
## 13 ./data/SRR446039_1.fastq.gz M12A M12 Mock.12h.A
## 14 ./data/SRR446040_1.fastq.gz M12B M12 Mock.12h.B
## 15 ./data/SRR446041_1.fastq.gz A12A A12 Avr.12h.A
## 16 ./data/SRR446042_1.fastq.gz A12B A12 Avr.12h.B
## 17 ./data/SRR446043_1.fastq.gz V12A V12 Vir.12h.A
## 18 ./data/SRR446044_1.fastq.gz V12B V12 Vir.12h.B
The following custom function trims adaptors hierarchically from the longest to
the shortest match of the right end of the reads. If internalmatch=TRUE
then internal matches will trigger the same behavior. The argument minpatternlength
defines the shortest adaptor match to consider in this iterative process. In addition, the function removes reads containing Ns or homopolymer regions. More detailed information on read preprocessing is provided in systemPipeR's
main vignette.
First, we construct SYSargs2
object from cwl
and yml
param and targets
files.
dir_path <- system.file("extdata/cwl/preprocessReads/trim-se",
package = "systemPipeR")
trim <- loadWorkflow(targets = targetspath, wf_file = "trim-se.cwl",
input_file = "trim-se.yml", dir_path = dir_path)
trim <- renderWF(trim, inputvars = c(FileName = "_FASTQ_PATH1_",
SampleName = "_SampleName_"))
trim
output(trim)[1:2]
Next, we execute the code for trimming all the raw data.
fctpath <- system.file("extdata", "custom_Fct.R", package = "systemPipeR")
source(fctpath)
iterTrim <- ".iterTrimbatch1(fq, pattern='ACACGTCT', internalmatch=FALSE, minpatternlength=6, Nnumber=1, polyhomo=50, minreadlength=16, maxreadlength=101)"
preprocessReads(args = trim, Fct = iterTrim, batchsize = 1e+05,
overwrite = TRUE, compress = TRUE)
writeTargetsout(x = trim, file = "targets_trim.txt", step = 1,
new_col = "FileName", new_col_output_index = 1, overwrite = TRUE)
The following seeFastq
and seeFastqPlot
functions generate and plot a series of
useful quality statistics for a set of FASTQ files including per cycle quality
box plots, base proportions, base-level quality trends, relative k-mer
diversity, length and occurrence distribution of reads, number of reads above
quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.png
.
fqlist <- seeFastq(fastq = infile1(trim), batchsize = 10000,
klength = 8)
png("./results/fastqReport.png", height = 18, width = 4 * length(fqlist),
units = "in", res = 72)
seeFastqPlot(fqlist)
dev.off()
Bowtie2/Tophat2
The NGS reads of this project will be aligned against the reference
genome sequence using Bowtie2/TopHat2
(Kim et al. 2013; Langmead and Salzberg 2012) in both
interactive job submissions and batch submissions to queuing systems of clusters
using the systemPipeR's
new CWL command-line interface.
Build Bowtie2
index.
dir_path <- system.file("extdata/cwl/bowtie2/bowtie2-idx", package = "systemPipeR")
idx <- loadWorkflow(targets = NULL, wf_file = "bowtie2-index.cwl",
input_file = "bowtie2-index.yml", dir_path = dir_path)
idx <- renderWF(idx)
idx
cmdlist(idx)
## Run in single machine
runCommandline(idx, make_bam = FALSE)
The parameter settings of the aligner are defined in the tophat2-mapping-pe.cwl
and tophat2-mapping-pe.yml
files. The following shows how to construct the
corresponding SYSargs2 object, here args.
dir_path <- system.file("extdata/cwl/tophat2/tophat2-se", package = "systemPipeR")
args <- loadWorkflow(targets = targetspath, wf_file = "tophat2-mapping-se.cwl",
input_file = "tophat2-mapping-se.yml", dir_path = dir_path)
args <- renderWF(args, inputvars = c(FileName = "_FASTQ_PATH1_",
SampleName = "_SampleName_"))
args
cmdlist(args)[1:2]
output(args)[1:2]
## Run in single machine
args <- runCommandline(args, make_bam = TRUE)
Submission of alignment jobs to compute cluster, here using 72 CPU cores
(18 qsub
processes each with 4 CPU cores).
## Run on the cluster
moduleload(modules(args))
resources <- list(walltime = 120, ntasks = 1, ncpus = 4, memory = 1024)
reg <- clusterRun(args, FUN = runCommandline, more.args = list(args = args,
make_bam = TRUE, dir = FALSE), conffile = ".batchtools.conf.R",
template = "batchtools.slurm.tmpl", Njobs = 18, runid = "01",
resourceList = resources)
waitForJobs(reg = reg)
HISAT2
The following steps will demonstrate how to use the short read aligner Hisat2
(Kim, Langmead, and Salzberg 2015) in both interactive job submissions and batch submissions to
queuing systems of clusters using the systemPipeR's
new CWL command-line interface.
Build Hisat2
index.
dir_path <- system.file("extdata/cwl/hisat2/hisat2-idx", package = "systemPipeR")
idx <- loadWorkflow(targets = NULL, wf_file = "hisat2-index.cwl",
input_file = "hisat2-index.yml", dir_path = dir_path)
idx <- renderWF(idx)
idx
cmdlist(idx)
## Run
runCommandline(idx, make_bam = FALSE)
The parameter settings of the aligner are defined in the hisat2-mapping-se.cwl
and hisat2-mapping-se.yml
files. The following shows how to construct the
corresponding SYSargs2 object, here args.
dir_path <- system.file("extdata/cwl/hisat2/hisat2-se", package = "systemPipeR")
args <- loadWorkflow(targets = targetspath, wf_file = "hisat2-mapping-se.cwl",
input_file = "hisat2-mapping-se.yml", dir_path = dir_path)
args <- renderWF(args, inputvars = c(FileName = "_FASTQ_PATH1_",
SampleName = "_SampleName_"))
args
cmdlist(args)[1:2]
output(args)[1:2]
## Run
args <- runCommandline(args)
library(batchtools)
resources <- list(walltime = 120, ntasks = 1, ncpus = 4, memory = 1024)
reg <- clusterRun(args, FUN = runCommandline, more.args = list(args = args,
make_bam = TRUE, dir = FALSE), conffile = ".batchtools.conf.R",
template = "batchtools.slurm.tmpl", Njobs = 18, runid = "01",
resourceList = resources)
getStatus(reg = reg)
waitForJobs(reg = reg)
Check whether all BAM files have been created.
outpaths <- subsetWF(args, slot = "output", subset = 1, index = 1)
file.exists(outpaths)
The following provides an overview of the number of reads in each sample and how many of them aligned to the reference.
read_statsDF <- alignStats(args = args)
write.table(read_statsDF, "results/alignStats.xls", row.names = FALSE,
quote = FALSE, sep = "\t")
read.table(system.file("extdata", "alignStats.xls", package = "systemPipeR"),
header = TRUE)[1:4, ]
## FileName Nreads2x Nalign Perc_Aligned Nalign_Primary
## 1 M1A 192918 177961 92.24697 177961
## 2 M1B 197484 159378 80.70426 159378
## 3 A1A 189870 176055 92.72397 176055
## 4 A1B 188854 147768 78.24457 147768
## Perc_Aligned_Primary
## 1 92.24697
## 2 80.70426
## 3 92.72397
## 4 78.24457
The symLink2bam
function creates symbolic links to view the BAM alignment files in a
genome browser such as IGV. The corresponding URLs are written to a file
with a path specified under urlfile
in the results
directory.
symLink2bam(sysargs = args, htmldir = c("~/.html/", "projects/tests/"),
urlbase = "http://biocluster.ucr.edu/~tgirke/", urlfile = "./results/IGVurl.txt")
The genFeatures
function generates a variety of feature types from
TxDb
objects using utilities provided by the GenomicFeatures
package.
The first step is the generation of the feature type ranges based on
annotations provided by a GFF file that can be transformed into a
TxDb
object. This includes ranges for mRNAs, exons, introns, UTRs,
CDSs, miRNAs, rRNAs, tRNAs, promoter and intergenic regions. In addition, any
number of custom annotations can be included in this routine.
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file = "data/tair10.gff", format = "gff3",
organism = "Arabidopsis")
feat <- genFeatures(txdb, featuretype = "all", reduce_ranges = TRUE,
upstream = 1000, downstream = 0, verbose = TRUE)
The featuretypeCounts
function counts how many reads in short read
alignment files (BAM format) overlap with entire annotation categories. This
utility is useful for analyzing the distribution of the read mappings across
feature types, e.g. coding versus non-coding genes. By default the
read counts are reported for the sense and antisense strand of each feature
type separately. To minimize memory consumption, the BAM files are processed in
a stream using utilities from the Rsamtools
and
GenomicAlignment
packages. The counts can be reported for each read
length separately or as a single value for reads of any length. Subsequently,
the counting results can be plotted with the associated
plotfeaturetypeCounts
function.
The following generates and plots feature counts for any read length.
library(ggplot2)
library(grid)
outpaths <- subsetWF(args, slot = "output", subset = 1, index = 1)
fc <- featuretypeCounts(bfl = BamFileList(outpaths, yieldSize = 50000),
grl = feat, singleEnd = TRUE, readlength = NULL, type = "data.frame")
p <- plotfeaturetypeCounts(x = fc, graphicsfile = "results/featureCounts.png",
graphicsformat = "png", scales = "fixed", anyreadlength = TRUE,
scale_length_val = NULL)
To determine the approximate read length of ribosome footprints in Ribo-Seq experiments, one can generate and plot the feature counts for specific read lengths separately. Typically, the most abundant read length obtained for translated features corresponds to the approximate footprint length occupied by the ribosomes.
fc2 <- featuretypeCounts(bfl = BamFileList(outpaths, yieldSize = 50000),
grl = feat, singleEnd = TRUE, readlength = c(74:76, 99:102),
type = "data.frame")
p2 <- plotfeaturetypeCounts(x = fc2, graphicsfile = "results/featureCounts2.png",
graphicsformat = "png", scales = "fixed", anyreadlength = FALSE,
scale_length_val = NULL)
The function predORF
can be used to identify open reading frames (ORFs) and coding sequences (CDSs) in DNA sequences provided as DNAString
or DNAStringSet
objects. The setting mode='ORF'
returns continuous reading frames that begin with a start codon and end with a stop codon, while mode='CDS'
returns continuous reading frames that do not need to begin or end with start or stop codons, respectively. Non-canonical start and stop condons are supported by allowing the user to provide any custom set of triplets under the startcodon
and stopcodon
arguments (i.e.
non-ATG start codons). The argument n
defines the maximum number of ORFs to return for each input sequence (e.g. n=1
returns only the longest ORF). It also supports the identification of overlapping and nested ORFs. Alternatively, one can return all non-overlapping ORFs including the longest ORF for each input sequence with n="all"
and longest_disjoint=TRUE
.
library(systemPipeRdata)
library(GenomicFeatures)
library(rtracklayer)
txdb <- makeTxDbFromGFF(file = "data/tair10.gff", format = "gff3",
organism = "Arabidopsis")
futr <- fiveUTRsByTranscript(txdb, use.names = TRUE)
dna <- extractTranscriptSeqs(FaFile("data/tair10.fasta"), futr)
uorf <- predORF(dna, n = "all", mode = "orf", longest_disjoint = TRUE,
strand = "sense")
To use the predicted ORF ranges for expression analysis given genome alignments
as input, it is necessary to scale them to the corresponding genome
coordinates. The function scaleRanges
does this by transforming the
mappings of spliced features (query ranges) to their corresponding genome
coordinates (subject ranges). The method accounts for introns in the subject
ranges that are absent in the query ranges. The above uORFs predicted in the
provided 5’ UTRs sequences using predORF
are a typical use case
for this application. These query ranges are given relative to the 5’ UTR
sequences and scaleRanges
will convert them to the corresponding
genome coordinates. The resulting GRangesList
object (here grl_scaled
)
can be directly used for read counting.
grl_scaled <- scaleRanges(subject = futr, query = uorf, type = "uORF",
verbose = TRUE)
export.gff3(unlist(grl_scaled), "results/uorf.gff")
To confirm the correctness of the obtained uORF ranges, one can parse their
corresponding DNA sequences from the reference genome with the getSeq
function and then translate them with the translate
function into
proteins. Typically, the returned protein sequences should start with a
M
(corresponding to start codon) and end with *
(corresponding to stop codon). The following example does this for a single uORF containing three exons.
translate(unlist(getSeq(FaFile("data/tair10.fasta"), grl_scaled[[7]])))
If required custom feature ranges can be added to the standard features generated with the genFeatures
function above. The following does this for the uORF ranges predicted with predORF
.
feat <- genFeatures(txdb, featuretype = "all", reduce_ranges = FALSE)
feat <- c(feat, GRangesList(uORF = unlist(grl_scaled)))
The following identifies continuous ORFs in intergenic regions. Note,
predORF
can only identify continuous ORFs in query sequences. The
function does not identify and remove introns prior to the ORF prediction.
feat <- genFeatures(txdb, featuretype = "intergenic", reduce_ranges = TRUE)
intergenic <- feat$intergenic
strand(intergenic) <- "+"
dna <- getSeq(FaFile("data/tair10.fasta"), intergenic)
names(dna) <- mcols(intergenic)$feature_by
sorf <- predORF(dna, n = "all", mode = "orf", longest_disjoint = TRUE,
strand = "both")
sorf <- sorf[width(sorf) > 60] # Remove sORFs below length cutoff, here 60bp
intergenic <- split(intergenic, mcols(intergenic)$feature_by)
grl_scaled_intergenic <- scaleRanges(subject = intergenic, query = sorf,
type = "sORF", verbose = TRUE)
export.gff3(unlist(grl_scaled_intergenic), "sorf.gff")
translate(getSeq(FaFile("data/tair10.fasta"), unlist(grl_scaled_intergenic)))
The featureCoverage
function computes the read coverage along
single and multi component features based on genomic alignments. The coverage
segments of component features are spliced to continuous ranges, such as exons
to transcripts or CDSs to ORFs. The results can be obtained with single
nucleotide resolution (e.g. around start and stop codons) or as mean coverage
of relative bin sizes, such as 100 bins for each feature. The latter allows
comparisons of coverage trends among transcripts of variable length. Additionally,
the results can be obtained for single or many features (e.g. any number of
transcripts) at once. Visualization of the coverage results is facilitated by
the downstream plotfeatureCoverage
function.
grl <- cdsBy(txdb, "tx", use.names = TRUE)
fcov <- featureCoverage(bfl = BamFileList(outpaths[1:2]), grl = grl[1:4],
resizereads = NULL, readlengthrange = NULL, Nbins = 20, method = mean,
fixedmatrix = FALSE, resizefeatures = TRUE, upstream = 20,
downstream = 20, outfile = "results/featureCoverage.xls",
overwrite = TRUE)
fcov <- featureCoverage(bfl = BamFileList(outpaths[1:4]), grl = grl[1:12],
resizereads = NULL, readlengthrange = NULL, Nbins = NULL,
method = mean, fixedmatrix = TRUE, resizefeatures = TRUE,
upstream = 20, downstream = 20, outfile = "results/featureCoverage.xls",
overwrite = TRUE)
plotfeatureCoverage(covMA = fcov, method = mean, scales = "fixed",
extendylim = 2, scale_count_val = 10^6)
library(ggplot2)
library(grid)
fcov <- featureCoverage(bfl = BamFileList(outpaths[1:4]), grl = grl[1:4],
resizereads = NULL, readlengthrange = NULL, Nbins = 20, method = mean,
fixedmatrix = TRUE, resizefeatures = TRUE, upstream = 20,
downstream = 20, outfile = "results/featureCoverage.xls",
overwrite = TRUE)
png("./results/featurePlot.png", height = 12, width = 24, units = "in",
res = 72)
plotfeatureCoverage(covMA = fcov, method = mean, scales = "fixed",
extendylim = 2, scale_count_val = 10^6)
dev.off()
fcov <- featureCoverage(bfl = BamFileList(outpaths[1:2]), grl = grl[1],
resizereads = NULL, readlengthrange = NULL, Nbins = NULL,
method = mean, fixedmatrix = FALSE, resizefeatures = TRUE,
upstream = 20, downstream = 20, outfile = NULL)
summarizeOverlaps
in parallel mode using multiple coresReads overlapping with annotation ranges of interest are counted for each
sample using the summarizeOverlaps
function (Lawrence et al. 2013). The
read counting is preformed for exonic gene regions in a non-strand-specific
manner while ignoring overlaps among different genes. Subsequently, the
expression count values are normalized by (RPKM). The raw read count table (countDFeByg.xls
) and the corresponding
RPKM table (rpkmDFeByg.xls
) are written to
separate files in the results
directory of this project.
Parallelization is achieved with the BiocParallel
package, here
using 8 CPU cores.
library("GenomicFeatures")
library(BiocParallel)
txdb <- loadDb("./data/tair10.sqlite")
eByg <- exonsBy(txdb, by = c("gene"))
bfl <- BamFileList(outpaths, yieldSize = 50000, index = character())
multicoreParam <- MulticoreParam(workers = 8)
register(multicoreParam)
registered()
counteByg <- bplapply(bfl, function(x) summarizeOverlaps(eByg,
x, mode = "Union", ignore.strand = TRUE, inter.feature = FALSE,
singleEnd = TRUE))
countDFeByg <- sapply(seq(along = counteByg), function(x) assays(counteByg[[x]])$counts)
rownames(countDFeByg) <- names(rowRanges(counteByg[[1]]))
colnames(countDFeByg) <- names(bfl)
rpkmDFeByg <- apply(countDFeByg, 2, function(x) returnRPKM(counts = x,
ranges = eByg))
write.table(countDFeByg, "results/countDFeByg.xls", col.names = NA,
quote = FALSE, sep = "\t")
write.table(rpkmDFeByg, "results/rpkmDFeByg.xls", col.names = NA,
quote = FALSE, sep = "\t")
Sample of data slice of count table
read.delim("results/countDFeByg.xls", row.names = 1, check.names = FALSE)[1:4,
1:5]
Sample of data slice of RPKM table
read.delim("results/rpkmDFeByg.xls", row.names = 1, check.names = FALSE)[1:4,
1:4]
Note, for most statistical differential expression or abundance analysis
methods, such as edgeR
or DESeq2
, the raw count values
should be used as input. The usage of RPKM values should be restricted to
specialty applications required by some users, e.g. manually comparing
the expression levels among different genes or features.
The following computes the sample-wise Spearman correlation coefficients from
the rlog
transformed expression values generated with the
DESeq2
package. After transformation to a distance matrix,
hierarchical clustering is performed with the hclust
function and
the result is plotted as a dendrogram and written to a file named sample_tree.png
in the results
directory.
library(DESeq2, quietly = TRUE)
library(ape, warn.conflicts = FALSE)
countDF <- as.matrix(read.table("./results/countDFeByg.xls"))
colData <- data.frame(row.names = targets.as.df(targets(args))$SampleName,
condition = targets.as.df(targets(args))$Factor)
dds <- DESeq2::DESeqDataSetFromMatrix(countData = countDF, colData = colData,
design = ~condition)
d <- cor(assay(DESeq2::rlog(dds)), method = "spearman")
hc <- hclust(dist(1 - d))
png("results/sample_tree.pdf")
ape::plot.phylo(ape::as.phylo(hc), type = "p", edge.col = "blue",
edge.width = 2, show.node.label = TRUE, no.margin = TRUE)
dev.off()
edgeR
The analysis of differentially expressed genes (DEGs) is performed with the glm
method from the edgeR
package (Robinson, McCarthy, and Smyth 2010). The sample
comparisons used by this analysis are defined in the header lines of the
targets.txt
file starting with <CMP>
.
library(edgeR)
countDF <- read.delim("results/countDFeByg.xls", row.names = 1,
check.names = FALSE)
targets <- read.delim("targets.txt", comment = "#")
cmp <- readComp(file = "targets.txt", format = "matrix", delim = "-")
edgeDF <- run_edgeR(countDF = countDF, targets = targets, cmp = cmp[[1]],
independent = FALSE, mdsplot = "")
Add functional gene descriptions, here from biomaRt
.
library("biomaRt")
m <- useMart("plants_mart", dataset = "athaliana_eg_gene", host = "plants.ensembl.org")
desc <- getBM(attributes = c("tair_locus", "description"), mart = m)
desc <- desc[!duplicated(desc[, 1]), ]
descv <- as.character(desc[, 2])
names(descv) <- as.character(desc[, 1])
edgeDF <- data.frame(edgeDF, Desc = descv[rownames(edgeDF)],
check.names = FALSE)
write.table(edgeDF, "./results/edgeRglm_allcomp.xls", quote = FALSE,
sep = "\t", col.names = NA)
Filter and plot DEG results for up and down regulated genes. The definition of
up
and down
is given in the corresponding help file. To
open it, type ?filterDEGs
in the R console.
edgeDF <- read.delim("results/edgeRglm_allcomp.xls", row.names = 1,
check.names = FALSE)
png("./results/DEGcounts.png", height = 10, width = 10, units = "in",
res = 72)
DEG_list <- filterDEGs(degDF = edgeDF, filter = c(Fold = 2, FDR = 20))
dev.off()
write.table(DEG_list$Summary, "./results/DEGcounts.xls", quote = FALSE,
sep = "\t", row.names = FALSE)
The function overLapper
can compute Venn intersects for large
numbers of sample sets (up to 20 or more) and vennPlot
can plot 2-5
way Venn diagrams. A useful feature is the possiblity to combine the counts
from several Venn comparisons with the same number of sample sets in a single
Venn diagram (here for 4 up and down DEG sets).
vennsetup <- overLapper(DEG_list$Up[6:9], type = "vennsets")
vennsetdown <- overLapper(DEG_list$Down[6:9], type = "vennsets")
png("results/vennplot.png")
vennPlot(list(vennsetup, vennsetdown), mymain = "", mysub = "",
colmode = 2, ccol = c("blue", "red"))
dev.off()
The following shows how to obtain gene-to-GO mappings from biomaRt
(here for A. thaliana) and how to organize them for the downstream GO
term enrichment analysis. Alternatively, the gene-to-GO mappings can be
obtained for many organisms from Bioconductor’s *.db
genome
annotation packages or GO annotation files provided by various genome
databases. For each annotation this relatively slow preprocessing step needs to
be performed only once. Subsequently, the preprocessed data can be loaded with
the load
function as shown in the next subsection.
library("biomaRt")
listMarts() # To choose BioMart database
listMarts(host = "plants.ensembl.org")
m <- useMart("plants_mart", host = "plants.ensembl.org")
listDatasets(m)
m <- useMart("plants_mart", dataset = "athaliana_eg_gene", host = "plants.ensembl.org")
listAttributes(m) # Choose data types you want to download
go <- getBM(attributes = c("go_id", "tair_locus", "namespace_1003"),
mart = m)
go <- go[go[, 3] != "", ]
go[, 3] <- as.character(go[, 3])
go[go[, 3] == "molecular_function", 3] <- "F"
go[go[, 3] == "biological_process", 3] <- "P"
go[go[, 3] == "cellular_component", 3] <- "C"
go[1:4, ]
dir.create("./data/GO")
write.table(go, "data/GO/GOannotationsBiomart_mod.txt", quote = FALSE,
row.names = FALSE, col.names = FALSE, sep = "\t")
catdb <- makeCATdb(myfile = "data/GO/GOannotationsBiomart_mod.txt",
lib = NULL, org = "", colno = c(1, 2, 3), idconv = NULL)
save(catdb, file = "data/GO/catdb.RData")
Apply the enrichment analysis to the DEG sets obtained the above differential
expression analysis. Note, in the following example the FDR
filter is set
here to an unreasonably high value, simply because of the small size of the toy
data set used in this vignette. Batch enrichment analysis of many gene sets is
performed with the function. When method=all
, it returns all GO terms passing
the p-value cutoff specified under the cutoff
arguments. When method=slim
,
it returns only the GO terms specified under the myslimv
argument. The given
example shows how a GO slim vector for a specific organism can be obtained from
BioMart.
library("biomaRt")
library(BBmisc) # Defines suppressAll()
load("data/GO/catdb.RData")
DEG_list <- filterDEGs(degDF = edgeDF, filter = c(Fold = 2, FDR = 50),
plot = FALSE)
up_down <- DEG_list$UporDown
names(up_down) <- paste(names(up_down), "_up_down", sep = "")
up <- DEG_list$Up
names(up) <- paste(names(up), "_up", sep = "")
down <- DEG_list$Down
names(down) <- paste(names(down), "_down", sep = "")
DEGlist <- c(up_down, up, down)
DEGlist <- DEGlist[sapply(DEGlist, length) > 0]
BatchResult <- GOCluster_Report(catdb = catdb, setlist = DEGlist,
method = "all", id_type = "gene", CLSZ = 2, cutoff = 0.9,
gocats = c("MF", "BP", "CC"), recordSpecGO = NULL)
library("biomaRt")
m <- useMart("plants_mart", dataset = "athaliana_eg_gene", host = "plants.ensembl.org")
goslimvec <- as.character(getBM(attributes = c("goslim_goa_accession"),
mart = m)[, 1])
BatchResultslim <- GOCluster_Report(catdb = catdb, setlist = DEGlist,
method = "slim", id_type = "gene", myslimv = goslimvec, CLSZ = 10,
cutoff = 0.01, gocats = c("MF", "BP", "CC"), recordSpecGO = NULL)
The data.frame
generated by GOCluster
can be plotted with the goBarplot
function. Because of the
variable size of the sample sets, it may not always be desirable to show
the results from different DEG sets in the same bar plot. Plotting
single sample sets is achieved by subsetting the input data frame as
shown in the first line of the following example.
gos <- BatchResultslim[grep("M6-V6_up_down", BatchResultslim$CLID),
]
gos <- BatchResultslim
png("./results/GOslimbarplotMF.png", height = 12, width = 12,
units = "in", res = 72)
goBarplot(gos, gocat = "MF")
dev.off()
goBarplot(gos, gocat = "BP")
goBarplot(gos, gocat = "CC")
Combined with mRNA-Seq data, Ribo-Seq or polyRibo-Seq experiments can be used
to study changes in translational efficiencies of genes and/or transcripts for
different treatments. For test purposes the following generates a small test
data set from the sample data used in this vignette, where two types of RNA
samples (assays
) are considered: polyribosomal mRNA (Ribo
)
and total mRNA (mRNA
). In addition, there are two treatments
(conditions
): M1
and A1
.
library(DESeq2)
countDFeBygpath <- system.file("extdata", "countDFeByg.xls",
package = "systemPipeR")
countDFeByg <- read.delim(countDFeBygpath, row.names = 1)
coldata <- DataFrame(assay = factor(rep(c("Ribo", "mRNA"), each = 4)),
condition = factor(rep(as.character(targets.as.df(targets(args))$Factor[1:4]),
2)), row.names = as.character(targets.as.df(targets(args))$SampleName)[1:8])
coldata
Differences in translational efficiencies can be calculated by ratios of ratios for the two conditions:
\[(Ribo\_A1 / mRNA\_A1) / (Ribo\_M1 / mRNA\_M1)\]
The latter can be modeled with the DESeq2
package using the design \(\sim assay + condition + assay:condition\), where the interaction term \(assay:condition\) represents the ratio of ratios. Using the likelihood ratio test of DESeq2
, which removes the interaction term in the reduced model, one can test whether the translational efficiency (ribosome loading) is different in condition A1
than in M1
.
dds <- DESeq2::DESeqDataSetFromMatrix(countData = as.matrix(countDFeByg[,
rownames(coldata)]), colData = coldata, design = ~assay +
condition + assay:condition)
# model.matrix(~ assay + condition + assay:condition,
# coldata) # Corresponding design matrix
dds <- DESeq2::DESeq(dds, test = "LRT", reduced = ~assay + condition)
res <- DESeq2::results(dds)
head(res[order(res$padj), ], 4)
# write.table(res, file='transleff.xls', quote=FALSE,
# col.names = NA, sep='\t')
The following example performs hierarchical clustering on the rlog
transformed expression matrix subsetted by the DEGs identified in the
above differential expression analysis. It uses a Pearson correlation-based distance measure and complete linkage for cluster joining.
library(pheatmap)
geneids <- unique(as.character(unlist(DEG_list[[1]])))
y <- assay(DESeq2::rlog(dds))[geneids, ]
png("heatmap1.png")
pheatmap(y, scale = "row", clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation")
dev.off()
rmarkdown::render("systemPipeRIBOseq.Rmd", "html_document")
rmarkdown::render("systemPipeRIBOseq.Rmd", "pdf_document")
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices
## [6] utils datasets methods base
##
## other attached packages:
## [1] systemPipeR_1.20.0 ShortRead_1.44.0
## [3] GenomicAlignments_1.22.1 SummarizedExperiment_1.16.0
## [5] DelayedArray_0.12.0 matrixStats_0.55.0
## [7] Biobase_2.46.0 BiocParallel_1.20.0
## [9] Rsamtools_2.2.1 Biostrings_2.54.0
## [11] XVector_0.26.0 GenomicRanges_1.38.0
## [13] GenomeInfoDb_1.22.0 IRanges_2.20.1
## [15] S4Vectors_0.24.0 BiocGenerics_0.32.0
## [17] BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Category_2.52.1 bitops_1.0-6
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] progress_1.2.2 httr_1.4.1
## [7] Rgraphviz_2.30.0 tools_3.6.1
## [9] backports_1.1.5 R6_2.4.1
## [11] DBI_1.0.0 lazyeval_0.2.2
## [13] colorspace_1.4-1 withr_2.1.2
## [15] tidyselect_0.2.5 prettyunits_1.0.2
## [17] bit_1.1-14 curl_4.2
## [19] compiler_3.6.1 graph_1.64.0
## [21] formatR_1.7 rtracklayer_1.46.0
## [23] bookdown_0.16 checkmate_1.9.4
## [25] scales_1.1.0 genefilter_1.68.0
## [27] RBGL_1.62.1 askpass_1.1
## [29] rappdirs_0.3.1 stringr_1.4.0
## [31] digest_0.6.23 rmarkdown_1.17
## [33] AnnotationForge_1.28.0 pkgconfig_2.0.3
## [35] htmltools_0.4.0 BSgenome_1.54.0
## [37] dbplyr_1.4.2 limma_3.42.0
## [39] rlang_0.4.2 RSQLite_2.1.2
## [41] GOstats_2.52.0 hwriter_1.3.2
## [43] dplyr_0.8.3 VariantAnnotation_1.32.0
## [45] RCurl_1.95-4.12 magrittr_1.5
## [47] GO.db_3.10.0 GenomeInfoDbData_1.2.2
## [49] Matrix_1.2-17 Rcpp_1.0.3
## [51] munsell_0.5.0 lifecycle_0.1.0
## [53] stringi_1.4.3 yaml_2.2.0
## [55] edgeR_3.28.0 debugme_1.1.0
## [57] zlibbioc_1.32.0 BiocFileCache_1.10.2
## [59] grid_3.6.1 blob_1.2.0
## [61] crayon_1.3.4 lattice_0.20-38
## [63] splines_3.6.1 GenomicFeatures_1.38.0
## [65] annotate_1.64.0 hms_0.5.2
## [67] batchtools_0.9.11 locfit_1.5-9.1
## [69] zeallot_0.1.0 knitr_1.26
## [71] pillar_1.4.2 rjson_0.2.20
## [73] base64url_1.4 codetools_0.2-16
## [75] biomaRt_2.42.0 XML_3.98-1.20
## [77] glue_1.3.1 evaluate_0.14
## [79] latticeExtra_0.6-28 data.table_1.12.6
## [81] BiocManager_1.30.10 vctrs_0.2.0
## [83] gtable_0.3.0 openssl_1.4.1
## [85] purrr_0.3.3 assertthat_0.2.1
## [87] ggplot2_3.2.1 xfun_0.11
## [89] xtable_1.8-4 survival_3.1-7
## [91] pheatmap_1.0.12 tibble_2.1.3
## [93] AnnotationDbi_1.48.0 memoise_1.1.0
## [95] brew_1.0-6 GSEABase_1.48.0
This research was funded by National Science Foundation Grants IOS-0750811 and MCB-1021969, and a Marie Curie European Economic Community Fellowship PIOF-GA-2012-327954.
Aspden, Julie L, Ying Chen Eyre-Walker, Rose J Phillips, Unum Amin, Muhammad Ali S Mumtaz, Michele Brocard, and Juan-Pablo Couso. 2014. “Extensive Translation of Small Open Reading Frames Revealed by Poly-Ribo-Seq.” Elife 3 (21~aug):e03528. https://doi.org/10.7554/eLife.03528.
Girke, Thomas. 2014. “systemPipeR: NGS Workflow and Report Generation Environment.” UC Riverside. https://github.com/tgirke/systemPipeR.
Ingolia, N T, S Ghaemmaghami, J R Newman, and J S Weissman. 2009. “Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling.” Science 324 (5924):218–23. https://doi.org/10.1016/j.ymeth.2009.03.016.
Ingolia, N T, L F Lareau, and J S Weissman. 2011. “Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes.” Cell 147 (4):789–802. https://doi.org/10.1016/j.cell.2011.10.002.
Juntawong, Piyada, Thomas Girke, Jérémie Bazin, and Julia Bailey-Serres. 2014. “Translational Dynamics Revealed by Genome-Wide Profiling of Ribosome Footprints in Arabidopsis.” Proc. Natl. Acad. Sci. U. S. A. 111 (1):E203–12. https://doi.org/10.1073/pnas.1317811111.
Juntawong, Piyada, Maureen Hummel, Jeremie Bazin, and Julia Bailey-Serres. 2015. “Ribosome Profiling: A Tool for Quantitative Evaluation of Dynamics in mRNA Translation.” Methods Mol. Biol. 1284:139–73. https://doi.org/10.1007/978-1-4939-2444-8\_7.
Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4):357–60.
Kim, Daehwan, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L Salzberg. 2013. “TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions.” Genome Biol. 14 (4):R36. https://doi.org/10.1186/gb-2013-14-4-r36.
Langmead, Ben, and Steven L Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nat. Methods 9 (4). Nature Publishing Group:357–59. https://doi.org/10.1038/nmeth.1923.
Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8):e1003118. https://doi.org/10.1371/journal.pcbi.1003118.
Robinson, M D, D J McCarthy, and G K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1):139–40. https://doi.org/10.1093/bioinformatics/btp616.