The basecallQC package provides tools to work with input and output files from Illumina basecalling and demultiplexing software, bcl2fastq (versions >= 2.1.7)
basecallQC 1.0.1
The bcl2Fastq software converts basecalls from Illumina High Throughput Sequencing (HTS) machines to fastQ files containing sequence information ready for further analysis.
The conversion of basecalls using the bcl2Fastq software requires multiple input files and parameters which allow the software to correctly basecall as well as demultiplex data into relevant samples by any index information contained within the data.
For detailed information on the latest version of bcl2Fastq software, its’ installation, its’ required input files and generated output files visit the Illumina software’s website here.
The bcl2Fastq software converts basecalls generated by Illumina’s Real Time Analysis (RTA) software to fastQ files. The output from RTA is presented in the directory structure below alongside the required input files runParameters.xml and a user generated SampleSheet.csv.
runParameters.xml - The runParameters.xml is generated by the RTA software during basecalling. This XML file contains information including RTA version, HTS machine used, specification of reads and indexes and number of cycles per read/index.
SampleSheet.csv - A user generated sample sheet , by default to be named SampleSheet.csv, may be present at the top level of directory structure to allow for specification of indexes to Sample IDs/Names. For bcl2Fastq versions >= 2.1.7, this SampleSheet.csv must contain set column names defining samples and their indexes ( Lane, Sample_ID, Sample_Name, Sample_Project, Index, Index2). For more details of sample sheet generation see the Illumina manual.
Alongside generated fastQ files from basecalls, bcl2Fastq produces several output files containing metrics describing the sequencing run. The ConversionStats.xml and DemultiplexingStats.xml files are provided under the Stats folder in the user specified output directory.
ConversionStats.xml - This XML contains information on basecalling per tile including the Cluster count, Yield, YieldQ30 and QualityScoreSum.
DemultiplexingStats.xml - This XML contains information on the demultiplexing of samples including the total, perfectmatch and mismatched barcode counts.
The basecallQC package provides a set of tools to streamline basecalling and demultiplexing using the Illumina bcl2fastq software (versions >= 2.1.7).
The basecallQC package includes functions to:-
The Illumina bcl2fastq program needs a sample sheet to dictate sample names/IDs, indexes and parameters for basecalling and demultiplexing.
The construction of sample sheets for basecalling and demultiplexing requires specific column names,valid Sample names/IDs and correct indexes. The complexity of working with Illumina sample sheets is compounded by the use of different sample sheet specifications before and after version 1.8.9 of the bcl2fastq program.
The basecallQC package uses the validateBCLSheet() function to both “clean” sample sheets to have valid column and sample names as well as to update sample sheets for versions <= 1.8.9 to be compatable with version >= 2.1.7
To update or clean a sample sheet, first the parameters for the Run must be captured using the BCL2FastQparams() function. The resulting BCL2FastQparams object contains information on the Run parsed from the runParameters.xml file such as the actual index length (often required to correct the sample sheet specified lengths).
Once the BCL2FastQparams object has been created, a sample sheet file name can be passed to validateBCLSheet() function alongside the BCL2FastQparams object to create a valid sample sheet data.frame, which may then be written to file.
In this first example we use the validateBCLSheet function to clean a sample sheet to be valid with bcl2Fastq versions >= 2.1.7.
library(basecallQC)
fileLocations <- system.file("extdata",package="basecallQC")
config <- dir(fileLocations,pattern="config.ini",full.names=TRUE)
fileLocations <- system.file(file.path("extdata","testSampleSheets"),package="basecallQC")
runXML <- dir(fileLocations,pattern="runParameters.xml",full.names=TRUE)
sampleSheet <- dir(fileLocations,pattern="*\\.csv",full.names=TRUE)
read.delim(sampleSheet[1],sep=",",header = TRUE,comment.char = "[")
## Sample_Project Lane Sample_ID Name index index2
## 1 EMP 1 A 1 A 1 CGATGT
## 2 EMP 1 B 1 B 1 TGACCA
## 3 EMP 1 C 1 C 1 ACAGTG
## 4 CC 6 08.HDOX37 08.HDOX37 AAGGAGTA CTCTCTAC
## 5 CC 6 08.HDOX39 08.HDOX39 AAGGAGTA GCTACGCT
## 6 CC 6 08.HDOXPC 08.HDOXPC AAGGAGTA AAGAGGCA
## 7 CT 8 PS:N4 PS:N4 CCGTCCCG
## 8 CT 8 CS*N1 CS*N1 GTCCGCAC
## 9 CT 8 CSN3 CSN3 GTGAAACG
This sample sheet contains invalid Sample names e.g. A 1 (whitespace) and 08.HDOX37 (starts with numeric) as well as invalid column headers such as “Name”.
bcl2fastqparams <- BCL2FastQparams(runXML,config,runDir=getwd(),verbose=FALSE)
cleanedSampleSheet <- validateBCLSheet(sampleSheet[1],param=bcl2fastqparams)
head(cleanedSampleSheet)
## # A tibble: 6 x 6
## Sample_Project Lane Sample_ID Sample_Name Index Index2
## <chr> <int> <chr> <chr> <chr> <chr>
## 1 EMP 1 A_1 A_1 CGATGT
## 2 EMP 1 B_1 B_1 TGACCA
## 3 EMP 1 C_1 C_1 ACAGTG
## 4 CC 6 Sample_08_HDOX37 Sample_08_HDOX37 AAGGAGTA CTCTCTAC
## 5 CC 6 Sample_08_HDOX39 Sample_08_HDOX39 AAGGAGTA GCTACGCT
## 6 CC 6 Sample_08_HDOXPC Sample_08_HDOXPC AAGGAGTA AAGAGGCA
In the cleaned samplesheet the column headers have been corrected and invalid Sample IDs converted to valid IDs e.g. 08.HDOX37 has been converted to Sample_HDOX37
Updating a sample sheet from those used with versions <= 1.8.9 to one compatable with versions >= 2.1.7 is done following the same procedure as cleaning a sample sheet.
library(basecallQC)
fileLocations <- system.file("extdata",package="basecallQC")
config <- dir(fileLocations,pattern="config.ini",full.names=TRUE)
fileLocations <- system.file(file.path("extdata","testSampleSheets"),package="basecallQC")
runXML <- dir(fileLocations,pattern="runParameters.xml",full.names=TRUE)
sampleSheet <- dir(fileLocations,pattern="*\\.csv",full.names=TRUE)
read.delim(sampleSheet[2],sep=",",header = TRUE,comment.char = "[")
## FCID Lane SampleID SampleRef Index Description Control
## 1 CADMGANXX 1 A1 A1 CGATGT NA NA
## 2 CADMGANXX 1 B1 B1 TGACCA NA NA
## 3 CADMGANXX 1 C1 C1 ACAGTG NA NA
## 4 CADMGANXX 6 DOX37_08H DOX37_08H CTCTCTAC-AAGGAGTA NA NA
## 5 CADMGANXX 6 DOX39_08H DOX39_08H GCTACGCT-AAGGAGTA NA NA
## 6 CADMGANXX 6 DOXPC_08H DOXPC_08H AAGGAGTA-AAGAGGCA NA NA
## 7 CADMGANXX 8 PSN4 PSN4 CCGTCCCG NA NA
## 8 CADMGANXX 8 CSN1 CSN1 GTCCGCAC NA NA
## 9 CADMGANXX 8 CSN3 CSN3 GTGAAACG NA NA
## Recipe Operator Project
## 1 NA NA EMP
## 2 NA NA EMP
## 3 NA NA EMP
## 4 NA NA CC
## 5 NA NA CC
## 6 NA NA CC
## 7 NA NA CT
## 8 NA NA CT
## 9 NA NA CT
The sample sheet from versions <= 1.8.9 contains many headers which will need to be updated to those used in versions >= 2.1.7 as well as unrequired headers which are maintained as metadata.
The specification of indexes in sample sheets from versions <= 1.8.9 were in a single column with indexes separated by a hyphon. This specification has changed to the explicit inclusion of an Index2 column when using dual indexes, so this to must be updated too.
bcl2fastqparams <- BCL2FastQparams(runXML,config,runDir=getwd(),verbose=FALSE)
cleanedSampleSheet <- validateBCLSheet(sampleSheet[2],param=bcl2fastqparams)
head(cleanedSampleSheet)
## # A tibble: 6 x 11
## Sample_Project Lane Sample_ID Sample_Name Index Index2 FCID
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 EMP 1 A1 A1 CGATGT CADMGANXX
## 2 EMP 1 B1 B1 TGACCA CADMGANXX
## 3 EMP 1 C1 C1 ACAGTG CADMGANXX
## 4 CC 6 DOX37_08H DOX37_08H CTCTCTAC AAGGAGTA CADMGANXX
## 5 CC 6 DOX39_08H DOX39_08H GCTACGCT AAGGAGTA CADMGANXX
## 6 CC 6 DOXPC_08H DOXPC_08H AAGGAGTA AAGAGGCA CADMGANXX
## # ... with 4 more variables: Description <lgl>, Control <lgl>, Recipe <lgl>,
## # Operator <lgl>
In the resulting updated sample sheet we can see column names have been updated e.g. “SampleID” to “Sample_ID”, and the index column has been automatically split into Index and Index2 columns.
The basecallQC package can also provide the base masks and basecalling/demultiplexing command for bcl2fastq versions >= 2.1.7 from the cleaned sample sheet and the BCL2FastQparams object.
The createBaseMasks() function creates a data.frame of basemasks per lane from a cleaned sample sheet,as generated by the validateBCLsheet() function, and a BCL2FastQparams object.
fileLocations <- system.file("extdata",package="basecallQC")
runXML <- dir(fileLocations,pattern="runParameters.xml",full.names=TRUE)
config <- dir(fileLocations,pattern="config.ini",full.names=TRUE)
sampleSheet <- dir(fileLocations,pattern="*\\.csv",full.names=TRUE)
bcl2fastqparams <- BCL2FastQparams(runXML,config,runDir=getwd(),verbose=FALSE)
cleanedSampleSheet <- validateBCLSheet(sampleSheet,param=bcl2fastqparams)
baseMasks <- createBasemasks(cleanedSampleSheet,param=bcl2fastqparams)
baseMasks$index1Mask
## [1] "IIIIIIIIN" "IIIIIIIIN" "IIIIIIIIN" "IIIIIINNN" "IIIIIINNN" "IIIIIIIIN"
## [7] "IIIIIINNN" "IIIIIINNN"
Following the creation of a base masks data.frame, the createBCLcommand function takes the cleaned sample sheet, a BCL2FastQparams object and the base masks data.frame to create the command to be used for Illumina basecalling/demultiplexing using bcl2fastq versions >= 2.1.7.
The command is returned as a character string to allow the user to control submission of the command to best fit the user’s system.
toSubmit <- createBCLcommand(bcl2fastqparams,cleanedSampleSheet,baseMasks)
toSubmit
## [1] "/usr/local/bcl2fastq/bin/bcl2fastq --output-dir /tmp/Rtmp2lHzsd/Rbuild485c4ab5a8a7/basecallQC/vignettes/C7JE1ANXX --sample-sheet /tmp/Rtmp2lHzsd/Rbuild485c4ab5a8a7/basecallQC/vignettes/C7JE1ANXX.csv --use-bases-mask 1:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIIIIN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 2:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIIIIN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 3:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIIIIN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 4:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIINNN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 5:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIINNN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 6:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIIIIN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 7:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIINNN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN --use-bases-mask 8:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN,IIIIIINNN,YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN"
Alongside generating fastQ from basecalls, the bcl2fastq software (versions >= 2.1.7) produces a number of basecalling and demultiplexing metric outputs in differing file types. The basecallQC package retrieves information from basecalling and demultiplexing XML files, ConversionStats.xml and DemultiplexingStats.xml respectively.
The baseCallMetrics() function accepts the BCL2FastQparams object for the Run and retrieve metrics from the ConversionStats.xml file.
bclMetrics <- baseCallMetrics(bcl2fastqparams)
head(bclMetrics[[1]])
## Project Sample Lane Tile Filter ReadNumber Yield Yield30
## 1 default Undetermined 1 2201 Raw Read_1 2062800 1241066
## 2 default Undetermined 1 2201 Raw Read_2 2062800 906623
## 3 default Undetermined 1 2201 Pf Read_1 1220300 1063657
## 4 default Undetermined 1 2201 Pf Read_2 1220300 801579
## 5 default Undetermined 1 1108 Raw Read_1 1705200 977058
## 6 default Undetermined 1 1108 Raw Read_2 1705200 739370
## QualityScoreSum ClusterCount
## 1 54066553 20628
## 2 41324426 20628
## 3 40915399 12203
## 4 31764167 12203
## 5 43532066 17052
## 6 34183713 17052
The baseCallMetrics() function returns a list where the first element contains the full unsummarised basecalling metrics from the ConversionStats.xml file. The second element contains these metrics summarised by Sample, Lane and Tile.
In the same way, the demultiplexMetrics() function accepts the BCL2FastQparams object for the Run and retrieve metrics from the DemultiplexingStats.xml file.
demuxMetrics <- demultiplexMetrics(bcl2fastqparams)
head(demuxMetrics[[1]])
## Project Sample Barcode BarcodeStat Lane Count
## 1 default Undetermined unknown BarcodeCount Lane1 973835
## 2 default Undetermined unknown PerfectBarcodeCount Lane1 973835
## 3 default all all BarcodeCount Lane1 973835
## 4 default all all PerfectBarcodeCount Lane1 973835
## 5 KGAK ALDIAGNCD138NEG ATTACTCG BarcodeCount Lane1 63414244
## 6 KGAK ALDIAGNCD138NEG ATTACTCG PerfectBarcodeCount Lane1 63161421
Again the demultiplexMetrics() function returns a list where the first element contains the full unsummarised demultiplexing metrics from the DemultiplexingStats.xml file. The second element contains these metrics filtered to sample information.
The lists objects generated by demultiplexMetrics() and baseCallMetrics() functions may then be used by plotting and table reporting functions shown in Section 3.
The basecallQC package provides a pipeline function, basecallQC(), to allow the user to clean/convert the samplesheet, create the bcl2fastq command and parse any basecalling/demultiplexing results. The resulting basecallQC object can then be used to produce summary tables, plots and a report.
The basecallQC() function takes a BCL2FastQparams object for the Run, a sample sheet and any Run metadata that the user wishes to attach to their experiment and returns a basecallQC object.
The resulting basecallQC object can be used with basecallQC’s plotting and reporting functions.
fileLocations <- system.file("extdata",package="basecallQC")
runXML <- dir(fileLocations,pattern="runParameters.xml",full.names=TRUE)
config <- dir(fileLocations,pattern="config.ini",full.names=TRUE)
sampleSheet <- dir(fileLocations,pattern="*\\.csv",full.names=TRUE)
outDir <- file.path(fileLocations,"Runs/161105_D00467_0205_AC9L0AANXX/C9L0AANXX/")
bcl2fastqparams <- BCL2FastQparams(runXML,config,runDir=getwd(),outDir,verbose=FALSE)
bclQC <- basecallQC(bcl2fastqparams,RunMetaData=NULL,sampleSheet)
class(bclQC)
## [1] "basecallQC"
## attr(,"package")
## [1] "basecallQC"
The basecallQC object and the lists returned from baseCallMetrics()/demultiplexMetrics() functions contain metrics on basecalling and/or demultiplexing for the Run. The objects can be used for both plotting and table reporting using the basecallQC functions seen in this section. In this following section we will demonstrate plotting from a basecallQC object.
Summary HTML tables from demultiplexing and basecalling results can be generated using the summaryDemuxTable and summaryConvStatsTable respectively. The output argument can be set to “static” or “html” to allow for tables for use in non-interactive and interactive modes.
summaryConvStatsTable(bclQC,output = "html")
summaryDemuxTable(bclQC,output = "html")
The user may also visualise the results using basecallQC’s plotting functions for basecalling and demultiplexing summary metrics.
The passFilterBar function produces a bar plot of yields per sample. The metric to plot and how to summarise data for plotting are controlled by the metricToPlot and groupBy arguments.
passFilterBar(bclQC,groupBy="Sample",metricToPlot = "Yield")
The passFilterTilePlot produces a plot of the specified metric across the Lanes and Tiles of the investigated Run.
passFilterTilePlot(bclQC,metricToPlot = "Yield")
## $PassFilter
##
## $FailFilter
The demuxBarplot function produces a bar plot of yield from demultiplexing. For demultiplexing statistics only the groupBy arguments can be used.
demuxBarplot(bclQC,groupBy="Sample")
The basecallQC package allows the user to generate a report per Illumina Run from the basecallQC object containing the most important demultiplexing and basecalling metrics using the reportBCL() function.
This report may be customised and the resulting RMD passed to the reportBCL() function’s reportRMDfile argument. This RMD will then be used in place of the standard basecallQC report.
reportBCL(bclQC)
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] basecallQC_1.0.1 yaml_2.1.14 prettydoc_0.2.0 knitr_1.16
## [5] rmarkdown_1.5 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.6.3 reshape2_1.4.2
## [3] lattice_0.20-35 colorspace_1.3-2
## [5] htmltools_0.3.6 stats4_3.4.0
## [7] XML_3.98-1.7 rlang_0.1.1
## [9] DBI_0.6-1 BiocParallel_1.10.1
## [11] RColorBrewer_1.1-2 BiocGenerics_0.22.0
## [13] sp_1.2-4 matrixStats_0.52.2
## [15] GenomeInfoDbData_0.99.0 plyr_1.8.4
## [17] stringr_1.2.0 zlibbioc_1.22.0
## [19] Biostrings_2.44.1 munsell_0.4.3
## [21] gtable_0.2.0 hwriter_1.3.2
## [23] raster_2.5-8 htmlwidgets_0.8
## [25] evaluate_0.10 labeling_0.3
## [27] latticeExtra_0.6-28 Biobase_2.36.2
## [29] IRanges_2.10.2 GenomeInfoDb_1.12.2
## [31] parallel_3.4.0 Rcpp_0.12.11
## [33] scales_0.4.1 backports_1.1.0
## [35] DT_0.2 DelayedArray_0.2.7
## [37] S4Vectors_0.14.3 jsonlite_1.5
## [39] XVector_0.16.0 ShortRead_1.34.0
## [41] Rsamtools_1.28.0 ggplot2_2.2.1
## [43] digest_0.6.12 stringi_1.1.5
## [45] bookdown_0.4 dplyr_0.5.0
## [47] GenomicRanges_1.28.3 grid_3.4.0
## [49] rprojroot_1.2 tools_3.4.0
## [51] bitops_1.0-6 magrittr_1.5
## [53] lazyeval_0.2.0 RCurl_1.95-4.8
## [55] tibble_1.3.3 tidyr_0.6.3
## [57] Matrix_1.2-10 data.table_1.10.4
## [59] assertthat_0.2.0 R6_2.2.1
## [61] GenomicAlignments_1.12.1 compiler_3.4.0