1 Introduction

systemPipeRdata provides data analysis workflow templates compatible with the systemPipeR software package (H Backman and Girke 2016). The latter is a Workflow Management System (WMS) for designing and running end-to-end analysis workflows with automated report generation for a wide range of data analysis applications. Support for running external software is provided by a command-line interface (CLI) that adopts the Common Workflow Language (CWL). How to use systemPipeR is explained in its main vignette here. The workflow templates provided by systemPipeRdata come equipped with sample data and the necessary parameter files required to run a selected workflow. This setup simplifies the learning process of using systemPipeR, facilitates testing of workflows, and serves as a foundation for designing new workflows. The standardized directory structure (Figure 1) utilized by the workflow templates and their sample data is outlined in the Directory Structure section of systemPipeR's main vignette.

Figure 1: Directory structure ofsystemPipeR's workflows. For details, see here.

2 Getting started

2.1 Installation

The systemPipeRdata package is available at Bioconductor and can be installed from within R as follows.

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("systemPipeRdata")

2.2 Loading package and documentation

library("systemPipeRdata")  # Loads the package

library(help = "systemPipeRdata")  # Lists package info
vignette("systemPipeRdata")  # Opens vignette

3 Overview of workflow templates

An overview table of workflow templates, included in systemPipeRdata, can be returned as shown below. By clicking the URLs in the last column of the below workflow list, users can view the Rmd source file of a workflow, as well as the final HTML report generated after running a workflow on the provided test data. A list of the default data analysis steps included in each workflow is given here. Additional workflow templates are available on this project’s GitHub organization (for details, see below). To create an empty workflow template without any test data included, users want to choose the Generic template, which includes only the required directory structure and parameter files.

availableWF()

Name	Description	URL
Generic	Generic Workflow Template	Rmd, HTML
SPblast	BLAST Template	Rmd, HTML
SPcheminfo	Cheminformatics Drug Similarity Template	Rmd, HTML
SPchipseq	ChIP-Seq Workflow Template	Rmd, HTML
SPriboseq	RIBO-Seq Workflow Template	Rmd, HTML
SPrnaseq	RNA-Seq Workflow Template	Rmd, HTML
SPscrna	Basic Single-Cell Template	Rmd, HTML
SPvarseq	VAR-Seq Template	Rmd, HTML

Table 1: Workflow templates

4 Use workflow templates

4.1 Load a workflow

The chosen example below uses the genWorkenvir function from the systemPipeRdata package to create an RNA-Seq workflow environment (selected under workflow="rnaseq") that is fully populated with a small test data set, including FASTQ files, reference genome and annotation data. The name of the resulting workflow directory can be specified under the mydirname argument. The default NULL uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. After this, the user’s R session needs to be directed into the resulting rnaseq directory (here with setwd). The other workflow templates from the above table can be loaded the same way.

library(systemPipeRdata)
genWorkenvir(workflow = "rnaseq", mydirname = "rnaseq")
setwd("rnaseq")

On Linux and OS X systems the same can be achieved from the command-line of a terminal with the following commands.

$ Rscript -e "systemPipeRdata::genWorkenvir(workflow='rnaseq', mydirname='rnaseq')"
$ cd rnaseq

4.2 Run and visualize workflow

For running and working with systemPipeR workflows, users want to visit systemPipeR’s main vignette. The following gives only a very brief preview on how to run workflows, and create scientific and technical reports.

After a workflow environment (directory) has been created and the corresponding R session directed into the resulting directory (here rnaseq), the workflow can be loaded from the included R Markdown file (Rmd, here systemPipeRNAseq.Rmd). This template provides common data analysis steps that are typical for RNA-Seq workflows. Users have the options to add, remove, modify workflow steps by applying these changes to the sal workflow management container directly, or updating the Rmd file first and then updating sal accordingly.

library(systemPipeR)
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE)

The default analysis steps of the imported RNA-Seq worflow are listed below. Users can modify the existing steps, add new ones or remove steps as needed.

Default analysis steps in RNA-Seq Workflow

Read preprocessing
- Quality filtering (trimming)
- FASTQ quality report
Alignments: HISAT2 (or any other RNA-Seq aligner)
Alignment stats
Read counting
Sample-wise correlation analysis
Analysis of differentially expressed genes (DEGs)
GO term enrichment analysis
Gene-wise clustering

Once the workflow has been loaded into sal, it can be executed from start to finish (or partially) with the runWF command.

sal <- runWF(sal)

Workflows can be visualized as topology graphs using the plotWF function.

plotWF(sal)

Figure 1: Toplogy graph of RNA-Seq workflow

Scientific and technical reports can be generated with the renderReport and renderLogs functions, respectively. Scientific reports can also be generated with the render function of the rmarkdown package. The technical reports are based on log informatation that systemPipeR collects during workflow runs.

# Scietific report
sal <- renderReport(sal)
rmarkdown::render("systemPipeRNAseq.Rmd", clean = TRUE, output_format = "BiocStyle::html_document")

# Technical (log) report
sal <- renderLogs(sal)

5 Additional workflow templates

The project’s GitHub Organization hosts a repository of workflow templates, containing both well-established and experimental workflows. Within the R environment, the same availableWF function mentioned earlier can be utilized to retrieve a list of the workflows in this collection.

availableWF(github = TRUE)

Additional Workflow Templates in systemPipeR GitHub Organization:
       Workflow                                     Download URL
1     SPatacseq    https://github.com/systemPipeR/SPatacseq.git
2     SPclipseq    https://github.com/systemPipeR/SPclipseq.git
3      SPdenovo    https://github.com/systemPipeR/SPdenovo.git
4         SPhic    https://github.com/systemPipeR/SPhic.git
5   SPmetatrans    https://github.com/systemPipeR/SPmetatrans.git
6   SPmethylseq    https://github.com/systemPipeR/SPmethylseq.git
7    SPmirnaseq    https://github.com/systemPipeR/SPmirnaseq.git
8 SPpolyriboseq    https://github.com/systemPipeR/SPpolyriboseq.git
9    SPscrnaseq    https://github.com/systemPipeR/SPscrnaseq.git

To download these workflow templates, users can either run the below git clone command from a terminal, or visit the corresponding GitHub page of a chosen workflow via the provided URLs, and then download it as a Zip file and uncompress it. Note, the following lines of code need to be run from a terminal (not R console, e.g. terminal in RStudio) on a system where the git software is installed.

$ git clone <...> # Provide under <...> URL of chosen workflow from table above.
$ cd <Workflow Name>

After a workflow template has been downloaded, one can run it the same way as outlined above.

6 Useful functionalities

6.1 Create workflow templates interactively

It is possible to create a new workflow environment from RStudio. This can be done by selecting File -> New File -> R Markdown -> From Template -> systemPipeR New WorkFlow. This option creates a template workflow that has the expected directory structure (see here).

Figure 2: Selecting workflow template within RStudio.

6.2 Return paths to sample data

The paths to the sample data provided by the systemPipeRdata package can be returned with the the pathList function.

pathList()[1:2]

## $targets
## [1] "/tmp/RtmpYnQAxT/Rinst22ad8f7286665a/systemPipeRdata/extdata/param/targets.txt"
## 
## $targetsPE
## [1] "/tmp/RtmpYnQAxT/Rinst22ad8f7286665a/systemPipeRdata/extdata/param/targetsPE.txt"

7 Analysis steps in selected workflows

The following gives an overview of the default data analysis steps used by selected workflow templates included in the systemPipeRdata package (see Table 1). The workflows hosted on this project’s GitHub Organization are not considered here.

Any of the workflows included below can be loaded by assigning their name to the workflow argument of the genWorkenvir function. The workflow names can be looked up under the ‘Name’ column of Table 1.

library(systemPipeRdata)
genWorkenvir(workflow = "...")

7.1 Generic template

This empty workflow (named new) is intended to be used as a template for creating new workflows from scratch where users can add steps by copying and pasting existing R or CL steps as needed, and populate them with their own code. In its current form, this mini workflow will export a test dataset to multiple files, compress/decompress the exported files, import them back into R, and then perform a simple statistical analysis and plot the results.

R step: export tabular data to files
CL step: compress files
CL step: uncompress files
R step: import files and plot summary statistics

7.2 RNA-Seq workflow

Read preprocessing
- Quality filtering (trimming)
- FASTQ quality report
Alignments: HISAT2 (or any other RNA-Seq aligner)
Alignment stats
Read counting
Sample-wise correlation analysis
Analysis of differentially expressed genes (DEGs)
GO term enrichment analysis
Gene-wise clustering

7.3 ChIP-Seq Workflow

Read preprocessing
- Quality filtering (trimming)
- FASTQ quality report
Alignments: Bowtie2 or rsubread
Alignment stats
Peak calling: MACS2
Peak annotation with genomic context
Differential binding analysis
GO term enrichment analysis
Motif analysis

7.4 VAR-Seq Workflow

Read preprocessing +Quality filtering (trimming) +FASTQ quality report
Alignments: bwa or other
Variant calling: GATK, BCFtools
Variant filtering: VariantTools and VariantAnnotation
Variant annotation: VariantAnnotation
Combine results from many samples
Summary statistics of samp

7.5 Ribo-Seq Workflow

Read preprocessing
- Adaptor trimming and quality filtering
- FASTQ quality report
Alignments: HISAT2 (or any other RNA-Seq aligner)
Alignment stats
Compute read distribution across genomic features
Adding custom features to workflow (e.g. uORFs)
Genomic read coverage along transcripts
Read counting
Sample-wise correlation analysis
Analysis of differentially expressed genes (DEGs)
GO term enrichment analysis
Gene-wise clustering
Differential ribosome binding (translational efficiency)

7.6 scRNA-Seq Workflow

Import of single cell read count data
Basic stats on input data
QC of cell count data
Cell filtering
Normalization
Identify high variable genes
Scaling
Embedding with tSNE, UMAP, and PCA
Cell clustering and marker gene classification
Cell type classification
Co-visualizatioin of cell types and clusters

7.7 BLAST Workflow

Load query sequences
Select and prepare BLASTable databases
Run BLAST against different databases

7.8 Cheminformatics Workflow

Import small molecules stored in SDF file
Visualize small molecule structures
Create atom pair and finger print databases for structure similarity searching
Compute all-against-all structural similarities
Hierarchical clustering and PCA of structural similarities
Plot heat map

8 Version information

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils    
## [6] datasets  methods   base     
## 
## other attached packages:
##  [1] magrittr_2.0.3              systemPipeRdata_2.9.3      
##  [3] systemPipeR_2.11.6          ShortRead_1.63.0           
##  [5] GenomicAlignments_1.41.0    SummarizedExperiment_1.35.1
##  [7] Biobase_2.65.0              MatrixGenerics_1.17.0      
##  [9] matrixStats_1.3.0           BiocParallel_1.39.0        
## [11] Rsamtools_2.21.0            Biostrings_2.73.1          
## [13] XVector_0.45.0              GenomicRanges_1.57.1       
## [15] GenomeInfoDb_1.41.1         IRanges_2.39.2             
## [17] S4Vectors_0.43.2            BiocGenerics_0.51.0        
## [19] BiocStyle_2.33.1           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        viridisLite_0.4.2      
##  [3] dplyr_1.1.4             bitops_1.0-8           
##  [5] fastmap_1.2.0           digest_0.6.36          
##  [7] lifecycle_1.0.4         pwalign_1.1.0          
##  [9] compiler_4.4.1          rlang_1.1.4            
## [11] sass_0.4.9              tools_4.4.1            
## [13] utf8_1.2.4              yaml_2.3.10            
## [15] knitr_1.48              S4Arrays_1.5.6         
## [17] htmlwidgets_1.6.4       interp_1.1-6           
## [19] DelayedArray_0.31.11    xml2_1.3.6             
## [21] RColorBrewer_1.1-3      abind_1.4-5            
## [23] hwriter_1.3.2.1         grid_4.4.1             
## [25] fansi_1.0.6             latticeExtra_0.6-30    
## [27] colorspace_2.1-1        ggplot2_3.5.1          
## [29] scales_1.3.0            cli_3.6.3              
## [31] rmarkdown_2.27          crayon_1.5.3           
## [33] generics_0.1.3          remotes_2.5.0          
## [35] rstudioapi_0.16.0       httr_1.4.7             
## [37] cachem_1.1.0            stringr_1.5.1          
## [39] zlibbioc_1.51.1         parallel_4.4.1         
## [41] formatR_1.14            BiocManager_1.30.23    
## [43] vctrs_0.6.5             Matrix_1.7-0           
## [45] jsonlite_1.8.8          bookdown_0.40          
## [47] systemfonts_1.1.0       jpeg_0.1-10            
## [49] jquerylib_0.1.4         glue_1.7.0             
## [51] codetools_0.2-20        stringi_1.8.4          
## [53] gtable_0.3.5            deldir_2.0-4           
## [55] UCSC.utils_1.1.0        munsell_0.5.1          
## [57] tibble_3.2.1            pillar_1.9.0           
## [59] htmltools_0.5.8.1       GenomeInfoDbData_1.2.12
## [61] R6_2.5.1                evaluate_0.24.0        
## [63] kableExtra_1.4.0        lattice_0.22-6         
## [65] highr_0.11              png_0.1-8              
## [67] bslib_0.8.0             Rcpp_1.0.13            
## [69] svglite_2.1.3           SparseArray_1.5.31     
## [71] xfun_0.46               pkgconfig_2.0.3

9 Funding

This project was supported by funds from the National Institutes of Health (NIH) and the National Science Foundation (NSF).

References

H Backman, Tyler W, and Thomas Girke. 2016. “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics 17 (1): 388. https://doi.org/10.1186/s12859-016-1241-0.

systemPipeRdata: Workflow templates and sample data

Last update: 06 August, 2024

Package