Contents

1 Sample inference from amplicon sequencing data with dada2

Benjamin Callahan, Joey McMurdie, Susan Holmes

Statistics Department, Stanford University

Stanford, CA 94305, USA

benjaming.j.callahan@gmail.com

DADA2 Home Page

2 Introduction

The investigation of environmental microbial communities and microbiomes has been driven in significant part by the recent widespread adoption of amplicon sequencing. In amplicon sequencing a particular genetic locus is amplified from DNA extracted from the community of interest, and then sequenced on a next-generation sequencing platform. This technique removes the need to culture microbes in order to detect their presence, and cost-effectively provides a deep census of a microbial community.

However, the process of amplicon sequencing introduces errors into the DNA sequences being analyzed, and these errors severely complicate the interpretation of the results. DADA2 implements a novel algorithm that models the errors introduced during amplicon sequencing, and uses that error model to infer the true sample composition. DADA2 takes the place of the ubiquitous “OTU-picking” step in amplicon sequencing workflows. As demonstrated in our preprint and in further benchmarking, the DADA2 method provides both better sensitivity and specificity than OTU methods: DADA2 detect real biological variation missed by OTU methods while outputting fewer spurious sequences.

3 Overview of the dada2 pipeline

The starting point for the dada2 pipeline is a set of demultiplexed fastq files corresponding to the samples in your amplicon sequencing study. That is, dada2 expects there to be an individual fastq file for each sample (or two fastq files, one forward and one reverse, for each sample). Demultiplexing is often performed at the sequencing center, but if that has not been done there are a variety of tools do accomplish this, including the popular QIIME python scripts split_libraries_fastq.py followed by split_sequence_file_on_sample_ids.py, and the utility idemp, among others.

Once demultiplexed fastq files are in hand, the dada2 pipeline proceeds as follows:

  1. Filter and Trim: fastqFilter() or fastqPairedFilter()
  2. Dereplicate: derepFastq()
  3. Infer sample composition: dada()
  4. Merge paired reads: mergePairs()
  5. Make sequence table: makeSequenceTable()
  6. Remove chimeras: isBimeraDenovo() or removeBimeraDenovo()

The output of the dada2 pipeline is a sample-by-sequence matrix, with each entry corresponding to the number of times that inferred sample sequence was observed in that sample. This table is analogous to the common OTU table, except at higher resolution (exact sample sequences rather than 97% OTUs).

We’ll now go through that pipeline on a highly simplified dataset of just one paired-end sample (we’ll add a second later).

4 Filter and Trim

We’ll start by getting the filenames of our example paired-end fastq files. Usually you will define these filenames directly, or read them out of a directory, but for this tutorial we’re using files included with the package, which we can identify via a particular function call:

library(dada2); packageVersion("dada2")
## [1] '0.99.9'
fnF1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2")
fnR1 <- system.file("extdata", "sam1R.fastq.gz", package="dada2")

filtF1 <- tempfile(fileext=".fastq.gz")
filtR1 <- tempfile(fileext=".fastq.gz")

Note that the dada2 package “speaks” the gzip format natively, all fastq files can remain in the space-saving gzip format throughout.

Now that we have the filenames, we’re going to inspect the quality profile of our data:

plotQualityProfile(fnF1) # Forward

plotQualityProfile(fnR1) # Reverse

After inspecting the quality profiles, it is clear that the reverse read quality drops off more severely than in the forward read. Thus we are going to trim the forward reads at position 240, and the reverse reads at position 200.

Filtering is an important step when dealing with sequence data, as low-quality sequences can contain unexpected and misleading errors. Trimming is also usually advised, as Illumina sequencing quality tends to drop off at the end of reads, and the initial nucleotides can also be problematic due to calibration issues:

fastqPairedFilter(c(fnF1, fnR1), fout=c(filtF1, filtR1), 
                  trimLeft=10, truncLen=c(240, 200), 
                  maxN=0, maxEE=2,
                  compress=TRUE, verbose=TRUE)
## Read in 1500 paired-sequences, output 890 filtered paired-sequences.

The fastqPairedFilter(...) function filters the forward and reverse reads jointly, outputting only those pairs of reads that both pass the filter. In this function call we did four things: We removed the first trimLeft=10 nucleotides of each read. We truncated the forward and reverse reads at truncLen=c(240, 200) nucleotides respectively. We filtered out all reads with more than maxN=0 ambiguous nucleotides. And we filtered out all reads with more than two expected errors. The filtered output files were stored as gzipped fastq files (compress=TRUE).

This represents a fairly standard set of filtering/trimming parameters. However, it is always worth evaluating whether the filtering and trimming parameters you are using are appropriate for your data. One size does not fit all!

An important consideration: This filtering assumes that the input pairs of forward/reverse reads were consistent with each other. That is, it assumes there was one forward read for every reverse read (and vice-versa) and that the ordering was the same in both fastq files. If this isn’t the case, the matchIDs argument of fastqPairedFilter should be explored.

5 Dereplicate

The next thing we want to do is “dereplicate” the filtered fastq files. During dereplication, we condense the data by collapsing together all reads that encode the same sequence, which significantly reduces later computation times:

derepF1 <- derepFastq(filtF1, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: /var/folders/95/v2vj9x4s5fqclldqlh1v57v40000gn/T//RtmpyCbcUs/file276d4749016.fastq.gz
## Encountered 332 unique sequences from 890 total sequences read.
derepR1 <- derepFastq(filtR1, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: /var/folders/95/v2vj9x4s5fqclldqlh1v57v40000gn/T//RtmpyCbcUs/file276d9f96a3.fastq.gz
## Encountered 485 unique sequences from 890 total sequences read.

Dereplication is a common step in almost all modern sample-inference (or OTU-picking) pipelines, but a unique feature of derepFastq is that it maintains a summary of the quality information for each dereplicated sequence in $quals.

6 Infer sample composition

The core method of the dada2 package is at the sample inference stage. The dada(...) function implements the algorithm described in our preprint, and is simultaneously more sensitive and more specific than any OTU algorithm we have ever tested.

The dada algorithm depends on a parametric model of the errors introduced by PCR amplification and sequencing. Those error parameters typically vary between sequencing runs and PCR protocols, so our method provides a way to estimate those parameters from the data itself. We recommend using this self-consistent estimation on at least a subset of your data for accurate sample inference.

Here we call the dada(...) function in selfConsist=TRUE mode, which causes it to alternate between inferring the sample composition and estimating the error rate parameters until convergence is reached:

dadaF1 <- dada(derepF1, err=inflateErr(tperr1,3), selfConsist=TRUE)
## Sample 1 - 890 reads in 332 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
## 
## 
## Convergence after  3  rounds.
dadaR1 <- dada(derepR1, err=inflateErr(tperr1,3), selfConsist=TRUE)
## Sample 1 - 890 reads in 485 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
## 
## 
## Convergence after  3  rounds.
print(dadaF1)
## dada-class: object describing DADA2 denoising results
## 8 sample sequences were inferred from 332 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE

The dada(...) algorithm inferred 8 sample sequences in the forward reads. As mentioned above, a set of error rate parameters is required, which err=inflateErr(tperr1, 3) provides here. However, in selfConsist=TRUE mode, this is just an initial guess, and the algorithm continues until it converges on a consistent estimate of the error rates.

7 Merge forward/reverse reads

We’ve inferred the sample sequences in the forward and reverse reads independently. Now it’s time to merge those inferred sequences together, throwing out those pairs of reads that don’t match:

merger1 <- mergePairs(dadaF1, derepF1, dadaR1, derepR1, verbose=TRUE)
## 841 paired-reads (in 7 unique pairings) successfully merged out of 890 (in 9 pairings) input.

The mergePairs(...) function returns a data.frame corresponding to each successfully merged unique sequence. The $forward and $reverse columns record which forward and reverse sequence contributed to that merged sequence. Thus, the second command removes all merged sequences which are made up of a chimeric forward or reverse sequence. At the end we have a data.frame of merged, error-free, non-chimeric sample sequences!

8 Remove chimeras

The dada(...) algorithm models and removes substitution errors, but chimeras are another importance source of spurious sequences in amplicon sequencing. Chimeras are formed during PCR amplification, when one sequence is incompletely amplified, and in the next amplication step is completed by copying a different sequence. The result is a sequencing read which is half one sample sequence, and half the other.

We identify those sequence using the isBimeraDenovo(...) function in the dada2 pipeline:

bim1 <- isBimeraDenovo(merger1, verbose=TRUE)
## Identified 0 bimeras out of 7 input sequences.
merger1.nochim <- merger1[!bim1,]

The return value of isBimeraDenovo(...) is a logical vector with a TRUE at each position in which the sequence is found to be consistently explained as a chimera produced by more abundant parent sequences. We used that vector to remove those chimeric sequences from the merger1 data.frame.

9 A second sample

In order to show an example of making a sequence table, and to reiterate the workflow outlined above, we now process a second sample:

# Assign filenames
fnF2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2")
fnR2 <- system.file("extdata", "sam2R.fastq.gz", package="dada2")
filtF2 <- tempfile(fileext=".fastq.gz")
filtR2 <- tempfile(fileext=".fastq.gz")
# Filter and Trim
fastqPairedFilter(c(fnF2, fnR2), fout=c(filtF2, filtR2), maxN=0, trimLeft=10, truncLen=c(240, 200), maxEE=2, compress=TRUE, verbose=TRUE)
## Read in 1500 paired-sequences, output 858 filtered paired-sequences.
# Dereplicate
derepF2 <- derepFastq(filtF2, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: /var/folders/95/v2vj9x4s5fqclldqlh1v57v40000gn/T//RtmpyCbcUs/file276d3cde877f.fastq.gz
## Encountered 305 unique sequences from 858 total sequences read.
derepR2 <- derepFastq(filtR2, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: /var/folders/95/v2vj9x4s5fqclldqlh1v57v40000gn/T//RtmpyCbcUs/file276d3591c211.fastq.gz
## Encountered 448 unique sequences from 858 total sequences read.
# Infer sample composition
dadaF2 <- dada(derepF2, err=inflateErr(tperr1,3), selfConsist=TRUE)
## Sample 1 - 858 reads in 305 unique sequences.
##    selfConsist step 2 
## 
## 
## Convergence after  2  rounds.
dadaR2 <- dada(derepR2, err=inflateErr(tperr1,3), selfConsist=TRUE)
## Sample 1 - 858 reads in 448 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
## 
## 
## Convergence after  3  rounds.
# Merge
merger2 <- mergePairs(dadaF2, derepF2, dadaR2, derepR2, verbose=TRUE)
## 785 paired-reads (in 6 unique pairings) successfully merged out of 858 (in 8 pairings) input.

With that second sample processed, we can go ahead and create a sequence table.

10 Create sequence table

Finally we want to combine the inferred samples into one unified table. For that purpose we use makeSequenceTable:

seqtab <- makeSequenceTable(list(merger1, merger2))
seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=TRUE)
## Identified 0 bimeras out of 7 input sequences.
dim(seqtab.nochim)
## [1] 2 7

This is the final product of the dada2 pipeline, a matrix in which each row corresponds to a processed sample, and each column corresponds to an non-chimeric inferred sample sequence (a more precise analogue to the common “OTU table”). From here we recommend proceeding forward with our friend the phyloseq package for further analysis…