xcms 1.52.0
xcms
This document describes new functionality and changes to existing functionality in the xcms
package introduced during the update to version 3.
library(xcms)
library(RColorBrewer)
register(SerialParam())
The modernization of the user interface comprises new classes for data representation and new data analysis methods. In addition, the core logic for the data processing has been extracted from the old methods and put into a set of R functions, the so called core API functions (or do_
functions). These functions take standard R data structures as input and return standard R data types as result and can hence be easily included in other R packages.
The new user interface aims at simplifying and streamlining the xcms
workflow while guaranteeing data integrity and performance also for large scale metabolomics experiments. Importantly, a simplified access to the original raw data should be provided throughout the whole metabolomics data analysis workflow.
The new interface re-uses objects from the MSnbase
Bioconductor package, such as the OnDiskMSnExp
object. This object is specifically designed for large scale MS experiments as it initially reads just the scan header information from the mzML while the mz-intensity value pairs from all or from selected spectra of a file are read on demand hence minimizing the memory demand. Also, in contrast to the old xcmsRaw
object, the OnDiskMSnExp
contains information from all files of an experiment. In addition, all data normalization and adjustment methods implemented in the MSnbase
package can be directly applied to the MS data without the need to re-implement such methods in xcms
. Results from xcms
preprocessings, such as chromatographic peak detection or correspondence are stored into the new XCMSnExp
object. This object extends the OnDiskMSnExp
object and inherits thus all of its methods including raw data access.
Class and method/function names follow also a new naming convention trying tp avoid the partially confusing nomenclature of the original xcms
methods (such as the group
method to perform the correspondence of peaks across samples). To distinguish them from mass peaks, the peaks identified by the peak detection in an LS/GC-MS experiment are referred to as chromatographic peaks. The respective method to identify such peaks is hence called findChromPeaks
and the identified peaks can be accessed using the XCMSnExp
chromPeaks
method. The results from an correspondence analysis which aims to match and group chromatographic peaks within and between samples are called features. The definition of such mz-rt features (i.e. the result from the groupChromPeaks
method) can be accessed via the featureDefinitions
method of the XCMSnExp
class. Finally, alignment (retention time correction) can be performed using the adjustRtime
method.
The settings for any of the new analysis methods are bundled in parameter classes, one class for each method. This encapsulation of the parameters to a function into a parameter class (such as CentWaveParam
) avoids busy function calls (with many single parameters) and enables saving, reloading and reusing the settings. In addition, the parameter classes are added, along with other information to the process history of an XCMSnExp
object thus providing a detailed documentation of each processing step of an analysis, with the possibility to recall all settings of the performed analyses at any stage. In addition, validation of the parameters can be performed within the parameter object and hence is no longer required in the analysis function.
The example below illustrates the new user interface. First we load the raw data files from the faahKO
package using the readMSData2
from the MSnbase
package.
## Reading the raw data using the MSnbase package
library(xcms)
## Load 6 of the CDF files from the faahKO
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
full.names = TRUE)[c(1:3, 7:9)]
## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that will be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = s_groups, stringsAsFactors = FALSE)
## Read the data.
raw_data <- readMSData2(cdf_files, pdata = new("NAnnotatedDataFrame", pheno))
We next plot the total ion chromatogram (TIC) for all files within the experiment. Note that we are iteratively sub-setting the full data per file using the filterFile
method, which, for OnDiskMSnExp
objects, is an efficient way to subset the data while ensuring that all data, including metadata, stays consistent.
library(RColorBrewer)
sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")
## Subset the full raw data by file and plot the data.
tmp <- filterFile(raw_data, file = 1)
plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
col = paste0(sample_colors[pData(tmp)$sample_group], 80), type = "l")
for (i in 2:length(fileNames(raw_data))) {
tmp <- filterFile(raw_data, file = i)
points(rtime(tmp), tic(tmp), type = "l",
col = paste0(sample_colors[pData(tmp)$sample_group], 80))
}
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)
Alternatively we can use the extractChromatograms
method that extracts chromatograms from the object. In the example below we extract the base peak chromatogram (BPC) by setting aggregationFun
to "max"
and not specifying an rt
or mz
range to extract only a data subset. In contrast to the tic
and bpi
methods, this function reads the data from the raw files.
## Get the base peak chromatograms. This reads data from the files.
bpis <- extractChromatograms(raw_data, aggregationFun = "max")
plot(3, 3, pch = NA, xlim = range(unlist(lapply(bpis, rtime))),
ylim = range(unlist(lapply(bpis, intensity))), main = "BPC",
xlab = "rtime", ylab = "intensity")
for (i in 1:length(bpis)) {
points(rtime(bpis[[i]]), intensity(bpis[[i]]), type = "l",
col = paste0(sample_colors[pData(raw_data)$sample_group[i]], 80))
}
Note that we could restrict the analysis to a certain retention time range by sub-setting raw_data
with the filterRt
method.
In addition we can plot the distribution of the total ion counts per file. In contrast to sub-setting the object we split the numeric vector returned by the tic
by file using the fromFile
method that provides the mapping of the experiment’s spectra to the originating files.
## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = paste0(sample_colors[pData(raw_data)$sample_group], 80),
ylab = "intensity", main = "Total ion current")
The tic
(and for mzML files) the bpi
methods are very fast, even for large data sets, as these information are stored in the header of the raw files avoiding the need to read the raw data from each file. Also, we could subset the whole object using the filter functions filterFile
, filterRt
or filterMz
to e.g. remove problematic samples or restrict the retention time range in which we want to perform the chromatographic peak detection.
Next we perform the chromatographic peak detection using the centWave algorithm [1]. In the example below we use most of the standard parameters, but the settings should be adjusted to each experiment individually based on e.g. the expected width of the chromatographic peaks etc.
## Defining the settings for the centWave peak detection.
cwp <- CentWaveParam(snthresh = 20, noise = 1000)
xod <- findChromPeaks(raw_data, param = cwp)
## Detecting mass traces at 25 ppm ...
## OK
## Detecting chromatographic peaks in 17808 regions of interest ...
## OK: 540 found.
## Detecting mass traces at 25 ppm ...
## OK
## Detecting chromatographic peaks in 18651 regions of interest ...
## OK: 649 found.
## Detecting mass traces at 25 ppm ...
## OK
## Detecting chromatographic peaks in 16040 regions of interest ...
## OK: 445 found.
## Detecting mass traces at 25 ppm ...
## OK
## Detecting chromatographic peaks in 18718 regions of interest ...
## OK: 548 found.
## Detecting mass traces at 25 ppm ...
## OK
## Detecting chromatographic peaks in 14431 regions of interest ...
## OK: 444 found.
## Detecting mass traces at 25 ppm ...
## OK
## Detecting chromatographic peaks in 15209 regions of interest ...
## OK: 472 found.
The identified peaks can be accessed with the chromPeaks
parameter which returns a matrix
, each line representing an identified peak. Column "sample"
specifies in which sample (i.e. file) of the experiment the peak was detected. Below we plot the signal distribution of the identified peaks per sample.
ints <- split(chromPeaks(xod)[, "into"], f = chromPeaks(xod)[, "sample"])
ints <- lapply(ints, log2)
boxplot(ints, varwidth = TRUE, col = sample_colors[pData(xod)$sample_group],
ylab = expression(log[2]~intensity), main = "Peak intensities")
After peak detection it might be advisable to evaluate whether the peak detection identified e.g. compounds known to be present in the sample. Facilitating access to the raw data has thus been one of the major aims for the updated user interface.
Next we extract the chromatogram for the rt-mz region corresponding to one detected chromatographic peak increasing the region in rt dimension by +/- 60 seconds. In addition we extract also all chromatographic peaks in that region by passing the same mz
and rt
parameters to the chromPeaks
method.
rtr <- chromPeaks(xod)[68, c("rtmin", "rtmax")]
## Increase the range:
rtr[1] <- rtr[1] - 60
rtr[2] <- rtr[2] + 60
mzr <- chromPeaks(xod)[68, c("mzmin", "mzmax")]
chrs <- extractChromatograms(xod, rt = rtr, mz = mzr)
## In addition we get all peaks detected in the same region
pks <- chromPeaks(xod, rt = rtr, mz = mzr)
Next we plot the extracted chromatogram for the data and highlight in addition the identified peaks.
## Define the limits on x- and y-dimension
xl <- range(lapply(chrs, rtime), na.rm = TRUE)
yl <- range(lapply(chrs, intensity), na.rm = TRUE)
plot(3, 3, pch = NA, main = paste(format(mzr, digits = 6), collapse = "-"),
xlab = "rt", ylab = "intensity", xlim = xl, ylim = yl)
## Plot the chromatogram per sample
for (i in 1:length(chrs)) {
points(rtime(chrs[[i]]), intensity(chrs[[i]]), type = "l",
col = sample_colors[pData(xod)$sample_group[i]])
}
## Highlight the identified chromatographic peaks.
for (i in 1:nrow(pks)) {
rect(xleft = pks[i, "rtmin"], xright = pks[i, "rtmax"], ybottom = 0,
ytop = pks[i, "maxo"],
border = paste0(sample_colors[pData(xod)$sample_group][pks[i, "sample"]], 60))
}
Note that the extractChromatograms
does return an NA
value if in a certain scan (i.e. for a specific retention time) no signal was measured in the respective mz range. This is reflected by the lines not being drawn as continuous lines in the plot above.
Next we align the samples using the obiwarp method [2]. This method does not require, in contrast to other alignment/retention time correction methods, any identified peaks and could thus also be applied to an OnDiskMSnExp
object. Note that all retention time adjustment methods do also adjust the retention times reported for the individual peaks in chromPeaks
.
## Doing the obiwarp alignment using the default settings.
xod <- adjustRtime(xod, param = ObiwarpParam())
Note that any pre-processing results can be removed at any time using a drop method, such as dropChromPeaks
, dropFeatureDefinitions
or dropAdjustedRtime
.
To evaluate the impact of the alignment we can plot again the BPC of each sample. In addition we plot the differences of the adjusted to the raw retention times per sample using the plotAdjustedRtime
function.
## Get the base peak chromatograms. This reads data from the files.
bpis <- extractChromatograms(xod, aggregationFun = "max")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(3, 3, pch = NA, xlim = range(unlist(lapply(bpis, rtime))),
ylim = range(unlist(lapply(bpis, intensity))), main = "BPC",
xlab = "rtime", ylab = "intensity")
for (i in 1:length(bpis)) {
points(rtime(bpis[[i]]), intensity(bpis[[i]]), type = "l",
col = paste0(sample_colors[pData(xod)$sample_group[i]], 80))
}
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xod, col = paste0(sample_colors[pData(xod)$sample_group], 80))
Too large differences between adjusted and raw retention times could indicate poorly performing samples or alignment.
The distribution of retention time differences could also be used for quality assessment.
## Calculate the difference between the adjusted and the raw retention times.
diffRt <- rtime(xod) - rtime(xod, adjusted = FALSE)
## By default, rtime and most other accessor methods return a numeric vector. To
## get the values grouped by sample we have to split this vector by file/sample
diffRt <- split(diffRt, fromFile(xod))
boxplot(diffRt, col = sample_colors[pData(xod)$sample_group],
main = "Obiwarp alignment results", ylab = "adjusted - raw rt")
The 3rd sample was used as center sample against which all other samples were aligned to, hence its adjusted retention times are identical to the raw retention times.
We are again plotting the extracted ion chromatogram for the selected peaks from above to evaluate the impact of the alignment.
After alignment, the peaks are nicely overlapping.
Next we group identified chromatographic peaks across samples. We use the peak density method [3] specifying that a chromatographic peak have to be present in at least 1/3 of the samples within each group to be combined to a mz-rt feature.
## Define the PeakDensityParam
pdp <- PeakDensityParam(sampleGroups = pData(xod)$sample_group,
maxFeatures = 300, minFraction = 0.66)
xod <- groupChromPeaks(xod, param = pdp)
The definitions of the features can be accessed with the featureDefinitions
, which lists the mz-rt space specific to a feature. Column "peakidx"
lists the indices (in the chromPeaks
matrix) of the individual chromatographic peaks belonging to the feature.
head(featureDefinitions(xod))
## DataFrame with 6 rows and 10 columns
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001 205.0 205.0 205.0 2788.485 2785.576 2791.067 6
## FT002 207.1 207.1 207.1 2716.355 2712.648 2718.939 3
## FT003 208.9 208.9 208.9 2606.350 2589.015 2667.274 3
## FT004 220.9 220.9 221.0 3234.088 3233.777 3247.385 3
## FT005 228.9 228.9 228.9 2659.407 2654.744 2664.070 2
## FT006 241.2 241.2 241.2 2911.009 2838.057 2955.814 5
## KO WT peakidx
## <numeric> <numeric> <list>
## FT001 3 3 75,635,1252,...
## FT002 2 1 598,1224,2232
## FT003 1 2 558,1642,2212
## FT004 1 2 1344,1887,2829
## FT005 2 0 24,574
## FT006 1 2 642,659,675,...
To extract values for the features, the featureValues
method can be used. This method returns a matrix with rows being the features and column the samples. The value
parameter allows to specify the value that should be returned. Below we extract the "into"
signal, i.e. the per-peak integrated intensity for each feature.
## Extract the "into" peak integrated signal.
head(featureValues(xod, value = "into"))
## ko15.CDF ko16.CDF ko18.CDF wt15.CDF wt16.CDF wt18.CDF
## FT001 1924712.02 1757150.96 1714581.78 2129885.09 1634341.986 1810112.276
## FT002 NA 451863.66 337473.31 NA 360908.934 NA
## FT003 NA 18319.89 NA 24254.37 7860.995 NA
## FT004 NA NA 13012.98 38930.94 NA 9087.955
## FT005 19707.81 15966.13 NA NA NA NA
## FT006 NA 12387.22 NA NA 12759.445 9455.730
After correspondence there will always be features that do not include peaks from every sample (being it that the peak finding algorithm failed to identify a peak or that no signal was measured in the respective mz-rt area). For such features an NA
is returned by the featureValues
method. Here, xcms
allows to infer values for such missing peaks using the fillChromPeaks
method. This method integrates in files where a peak was not found the signal from the mz-rt area where it is expected and adds it to the chromPeaks
matrix. Such filled-in peaks have a value of 1
in the "is_filled"
column of the chromPeaks
matrix.
## Fill in peaks with default settings. Settings can be adjusted by passing
## a FillChromPeaksParam object to the method.
xod <- fillChromPeaks(xod)
head(featureValues(xod, value = "into"))
## ko15.CDF ko16.CDF ko18.CDF wt15.CDF wt16.CDF wt18.CDF
## FT001 1924712.02 1757150.96 1714581.776 2129885.09 1634341.986 1810112.276
## FT002 362020.63 451863.66 337473.308 353054.86 360908.934 267456.999
## FT003 NA 18319.89 NA 24254.37 7860.995 NA
## FT004 10434.64 NA 13012.975 38930.94 23296.981 9087.955
## FT005 19707.81 15966.13 6429.364 NA NA 8981.056
## FT006 NA 12387.22 6494.872 NA 12759.445 9455.730
Not for all missing peaks a value could be integrated (because at the respective location no measurements are available). The peak area from which signal is to be extracted can also be increased modifying the settings by passing a FillChromPeaksParam
object.
Next we inspect the processHistory
of the analysis. As described earlier, this records all (major) processing steps along with the corresponding parameter classes.
## List the full process history
processHistory(xod)
## [[1]]
## Object of class "XProcessHistory"
## type: Peak detection
## date: Mon Apr 24 18:29:07 2017
## info:
## fileIndex: 1,2,3,4,5,6
## Parameter class: CentWaveParam
##
## [[2]]
## Object of class "XProcessHistory"
## type: Retention time correction
## date: Mon Apr 24 18:29:48 2017
## info:
## fileIndex: 1,2,3,4,5,6
## Parameter class: ObiwarpParam
##
## [[3]]
## Object of class "XProcessHistory"
## type: Peak grouping
## date: Mon Apr 24 18:30:04 2017
## info:
## fileIndex: 1,2,3,4,5,6
## Parameter class: PeakDensityParam
##
## [[4]]
## Object of class "XProcessHistory"
## type: Missing peak filling
## date: Mon Apr 24 18:30:07 2017
## info:
## fileIndex: 1,2,3,4,5,6
## Parameter class: FillChromPeaksParam
It is also possible to extract specific processing steps by specifying its type. Available types can be listed with the processHistoryTypes
function. Below we extract the parameter class for the alignment/retention time adjustment step.
ph <- processHistory(xod, type = "Retention time correction")
## Access the parameter
processParam(ph[[1]])
## Object of class: ObiwarpParam
## Parameters:
## binSize: 1
## centerSample:
## response: 1
## distFun: cor_opt
## gapInit: 0.3
## gapExtend: 2.4
## factorDiag: 2
## factorGap: 1
## localAlignment: FALSE
## initPenalty: 0
As described earlier, we can remove specific analysis results at any stage. Below we remove the results from the alignment. Since the correspondence was performed after that processing step its results will be removed too leaving us only with the results from the peak detection step.
## Remove the alignment results
xod <- dropAdjustedRtime(xod)
processHistory(xod)
## [[1]]
## Object of class "XProcessHistory"
## type: Peak detection
## date: Mon Apr 24 18:29:07 2017
## info:
## fileIndex: 1,2,3,4,5,6
## Parameter class: CentWaveParam
We can now use a different method to perform the alignment. The peak groups alignment method bases the alignment of the samples on chromatographic peaks present in most samples (so called well behaved peaks). This means we have to perform first an initial correspondence analysis to group peaks within and across samples.
## Define the parameter for the correspondence
pdparam <- PeakDensityParam(sampleGroups = pData(xod)$sample_group,
minFraction = 0.7, maxFeatures = 100)
xod <- groupChromPeaks(xod, param = pdparam)
Before performing the alignment we can also inspect which peak groups might be selected for alignment based on the provided PeakGroupsParam
object.
## Create the parameter class for the alignment
pgparam <- PeakGroupsParam(minFraction = 0.9, span = 0.4)
## Extract the matrix with (raw) retention times for the peak groups that would
## be used for alignment.
adjustRtimePeakGroups(xod, param = pgparam)
## ko15.CDF ko16.CDF ko18.CDF wt15.CDF wt16.CDF wt18.CDF
## FT006 2712.647 2718.906 2714.213 2712.647 2718.906 2686.043
## FT096 2786.200 2792.459 2787.766 2786.200 2787.764 2795.590
## FT001 2784.635 2790.894 2787.766 2784.635 2790.894 2790.895
## FT048 2797.154 2812.803 2798.720 2797.154 2873.837 2801.849
## FT013 2923.916 2925.480 2923.917 2923.915 2927.045 2928.611
## FT086 3368.362 3313.589 3200.913 3218.127 3235.341 3207.172
## FT079 3341.758 3354.277 3349.584 3340.193 3355.842 3355.843
## FT080 3341.758 3354.277 3349.584 3341.758 3355.842 3355.843
## FT036 3491.994 3502.948 3495.125 3490.429 3502.948 3491.994
## FT026 3493.559 3502.948 3495.125 3490.429 3502.948 3504.514
## FT072 3512.338 3531.117 3523.294 3513.903 3529.552 3531.118
## FT071 3515.468 3529.552 3524.859 3513.903 3529.552 3532.683
## FT025 3618.755 3631.274 3626.581 3614.060 3632.839 3635.970
## FT028 3657.879 3670.398 3664.140 3657.879 3671.963 3671.964
## FT004 3656.314 3671.963 3665.705 3657.879 3671.963 3668.834
## FT078 3662.574 3690.742 3706.393 3661.009 3693.872 3722.042
## FT084 3703.262 3718.912 3712.653 3700.132 3720.476 3712.652
## FT074 3747.081 3767.425 3768.991 3740.821 3767.425 3781.510
## FT073 3748.646 3767.425 3767.426 3740.821 3768.990 3779.945
## FT057 3859.758 3875.407 3870.713 3847.238 3876.972 3875.407
If we are not happy with these peak groups (e.g. because we don’t have a peak group for a rather large time span along the retention time axis) we can try different settings. In addition, we could also manually select certain peak groups, e.g. for internal controls, and add this matrix with the peakGroupsMatrix
method to the PeakGroupsParam
class. Below we just use pgparam
we defined and perform the alignment. This will use the peak groups matrix from above.
## Perform the alignment using the peak groups method.
xod <- adjustRtime(xod, param = pgparam)
We can now also plot the difference between adjusted and raw retention times. If alignment was performed using the peak groups method, also these peak groups are highlighted in the plot.
plotAdjustedRtime(xod, col = sample_colors[pData(xod)$sample_group])
Methods for data analysis from the original xcms
code have been renamed to avoid potential confusions:
Chromatographic peak detection: findChromPeaks
instead of findPeaks
: for new functions and methods the term peak is avoided as much as possible, as it is usually used to describe a mass peak in mz dimension. To clearly distinguish between these peaks and peaks in retention time space, the latter are referred to as chromatographic peak, or chromPeak
.
Correspondence: groupChromPeaks
instead of group
to clearly indicate what is being grouped. Group might be a sample group or a peak group, the latter being referred to also by (mz-rt) feature.
Alignment: adjustRtime
instead of retcor
for retention time correction. The word cor in retcor might be easily misinterpreted as correlation instead of correction.
OnDiskMSnExp
This object is defined and documented in the MSnbase
package. In brief, it is a container for the full raw data from an MS-based experiment. To keep the memory footprint low the mz and intensity values are only loaded from the raw data files when required. The OnDiskMSnExp
object replaces the xcmsRaw
object.
XCMSnExp
The XCMSnExp
class extends the OnDiskMSnExp
object from the MSnbase
package and represents a container for the xcms-based preprocessing results while (since it inherits all functionality from its parent class) keeping a direct relation to the (raw) data on which the processing was performed. An additional slot .processHistory
in the object allows to keep track of all performed processing steps. Each analysis method, such as findChromPeaks
adds an XProcessHistory
object which includes also the parameter class passed to the analysis method. Hence not only the time and type of the analysis, but its exact settings are reported within the XCMSnExp
object. The XCMSnExp
is thus equivalent to the xcmsSet
from the original xcms
implementation, but keeps in addition a link to the raw data on which the preprocessing was performed.
Chromatogram
The Chromatogram
class allows a data representation that is orthogonal to the Spectrum
class defined in MSnbase
. The Chromatogram
class stores retention time and intensity duplets and is designed to accommodate most use cases, from total ion chromatogram, base peak chromatogram to extracted ion chromatogram and SRM/MRM ion traces.
Chromatogram
objects can be extracted from XCMSnExp
objects using the extractChromatograms
method.
Note that this class is still considered developmental and might thus undergo some changes in the future.
The binning/profile matrix generation functions have been completely rewritten. The new binYonX
function replaces the binning of intensity values into bins defined by their m/z values implemented in the profBin
, profBinLin
and profBinLinBase
methods. The binYonX
function provides also additional functionality:
Breaks for the bins can be defined based on either the number of desired bins (nBins
) or the size of a bin (binSize
). In addition it is possible to provide a vector with pre-defined breaks. This allows to bin data from multiple files or scans on the same bin-definition.
The function returns a list with element y
containing the binned values and element x
the bin mid-points.
Values in input vector y
can be aggregated within each bin with different methods: max
, min
, sum
and mean
.
The index of the largest (or smallest for method
being “min”) within each bin can be returned by setting argument returnIndex
to TRUE
.
Binning can be performed on single or multiple sub-sets of the input vectors using the fromIdx
and toIdx
arguments. This replaces the M methods (such as profBinM
). These sub-sets can be overlapping.
The missing value imputation logic inherently build into the profBinLin
and profBinLinBase
methods has been implemented in the imputeLinInterpol
function.
The example below illustrates the binning and imputation with the binYtoX
and imputeLinInterpol
functions. After binning of the test vectors below some of the bins have missing values, for which we impute a value using imputeLinInterpol
. By default, binYonX
selects the largest value within each bin, but other aggregation methods are also available (i.e. min, max, mean, sum).
## Defining the variables:
set.seed(123)
X <- sort(abs(rnorm(30, mean = 20, sd = 25))) ## 10
Y <- abs(rnorm(30, mean = 50, sd = 30))
## Bin the values in Y into 20 bins defined on X
res <- binYonX(X, Y, nBins = 22)
res
## $x
## [1] 3.207154 6.066022 8.924891 11.783759 14.642628 17.501497 20.360365
## [8] 23.219234 26.078102 28.936971 31.795840 34.654708 37.513577 40.372445
## [15] 43.231314 46.090183 48.949051 51.807920 54.666788 57.525657 60.384526
## [22] 63.243394
##
## $y
## [1] 76.85377 76.34400 48.14265 29.15879 43.76248 NA 115.06868
## [8] 86.23886 NA 73.39895 49.14360 NA 91.05807 43.22687
## [15] NA NA NA 95.49412 NA NA 67.53841
## [22] 56.47825
As a result we get a list
with the bin mid-points ($x
) and the binned y
values ($y
).
Next we use two different imputation approaches, a simple linear interpolation and the linear imputation approach that was defined in the profBinLinBase
method. The latter performs linear interpolation only considering a certain neighborhood of missing values otherwise replacing the NA
with a base value.
## Plot the actual data values.
plot(X, Y, pch = 16, ylim = c(0, max(Y)))
## Visualizing the bins
abline(v = breaks_on_nBins(min(X), max(X), nBins = 22), col = "grey")
## Define colors:
point_colors <- paste0(brewer.pal(4, "Set1"), 80)
## Plot the binned values.
points(x = res$x, y = res$y, col = point_colors[1], pch = 15)
## Perform the linear imputation.
res_lin <- imputeLinInterpol(res$y)
points(x = res$x, y = res_lin, col = point_colors[2], type = "b")
## Perform the linear imputation "linbase"
res_linbase <- imputeLinInterpol(res$y, method = "linbase")
points(x = res$x, y = res_linbase, col = point_colors[3], type = "b", lty = 2)
The difference between the linear interpolation method lin
and linbase
is that the latter only performs the linear interpolation in a pre-defined neighborhood of the bin with the missing value (1
by default). The other missing values are set to a base value corresponding to half of the smallest bin value. Both methods thus yield same results, except for bins 15-17 (see Figure above).
The core logic from the chromatographic peak detection methods findPeaks.centWave
, findPeaks.massifquant
, findPeaks.matchedFilter
and findPeaks.MSW
and from all alignment (group.*
) and correspondence (retcor.*
) methods has been extracted and put into functions with the common prefix do_findChromPeaks
, do_adjustRtime
and do_groupChromPeaks
, respectively, with the aim, as detailed in issue #30, to separate the core logic from the analysis methods invoked by the users to enable also the use these methods using base R parameters (i.e. without specific classes containing the data such as the xcmsRaw
class). This simplifies also the re-use of these functions in other packages and simplifies the future implementation of the peak detection algorithms for e.g. the MSnExp
or OnDiskMSnExp
objects from the MSnbase
Bioconductor package. The implemented functions are:
do_findChromPeaks_centWave
: peak density and wavelet based peak detection for high resolution LC/MS data in centroid mode [1].do_findChromPeaks_matchedFilter
: identification of peak in the chromatographic domain based on matched filtration [3].do_findChromPeaks_massifquant
: identification of peaks using Kalman filters.do_findChromPeaks_MSW
: single spectrum, non-chromatographic peak detection.do_adjustRtime_peakGroups
: perform sample alignment (retention time correction) using alignment of well behaved chromatographic peaks that are present in most samples (and are expected to have the same retention time).do_groupChromPeaks_density
: perform chromatographic peak grouping (within and across samples) based on the density distribution of peaks along the retention time axis.do_groupChromPeaks_nearest
: groups peaks across samples similar to the method implemented in mzMine.do_groupChromPeaks_mzClust
: performs high resolution correspondence on single spectra samples.One possible drawback from the introduction of this new layer is, that more objects get copied by R which could eventually result in a larger memory demand or performance decrease (while no such was decrease was observed up to now).
[
subsetting method for xcmsRaw
objects that enables to subset an xcmsRaw
object to specific scans/spectra.profMat
method to extract the profile matrix from the xcmsRaw
object. This method should be used instead of directly accessing the @env$profile
slot, as it will create the profile matrix on the fly if it was not pre-calculated (or if profile matrix generation settings have been changed).profBinLin
).From xcms
version 1.51.1 on the new binning functions are used, thus, the bug described here are fixed.
Two bugs are present in the profBinLin
method (reported as issues #46 and #49 on github) which are fixed in the new binYonX
and imputeLinInterpol
functions:
profBinLin
can be wrong (i.e. not being the max value within that bin, but the first).The profBinLin
method is used in findPeaks.matchedFilter
if the profile method is set to “binlin”.
The example below illustrates both differences.
## Define a vector with empty values at the end.
X <- 1:11
set.seed(123)
Y <- sort(rnorm(11, mean = 20, sd = 10))
Y[9:11] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
xend = max(X))
## Warning: Use of 'profBinLin' is deprecated! Use 'binYonX' with
## 'imputeLinInterpol' instead.
resX
## [1] 7.349388 15.543380 21.292877 0.000000 0.000000
res <- binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- imputeLinInterpol(res$y, method = "lin",
noInterpolAtEnds = TRUE)
resM
## [1] 13.13147 15.54338 21.29288 24.60916 0.00000
Plotting the results helps to better compare the differences. The black points in the figure below represent the actual values of Y
and the grey vertical lines the breaks defining the bins. The blue lines and points represent the result from the profBinLin
method. The bin values for the first and 4th bin are clearly wrong. The green colored points and lines represent the results from the binYonX
and imputeLinInterpol
functions (showing the correct binning and interpolation).
plot(x = X, y = Y, pch = 16, ylim = c(0, max(Y, na.rm = TRUE)),
xlim = c(0, 12))
## Plot the breaks
abline(v = breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
pch = 4, lty = 2)
Note that by default imputeLinInterpol
would also interpolate missing values at the beginning and the end of the provided numeric vector. This can be disabled (to be compliant with profBinLin
) by setting parameter noInterpolAtEnds
to TRUE
(like in the example above).
do_findChromPeaks_matchedFilter
, respectively findPeaks.matchedFilter
.The original findPeaks.matchedFilter
(up to version 1.49.7) had several shortcomings and bugs that have been fixed in the new do_findChromPeaks_matchedFilter
method:
The internal iterative processing of smaller chunks of the full data (also referred to as iterative buffering) could result, for some bin (step) sizes to unstable binning results (discussed in issue #47 on github): calculation of the breaks, or to be precise, the actually used bin size was performed in each iteration and could lead to slightly different sizes between iterations (due to rounding errors caused by floating point number representations in C).
The iterative buffering raises also a conceptual issue when linear interpolation is performed to impute missing values: the linear imputation will only consider values within the actually processed buffer and can thus lead to wrong or inaccurate imputations.
The profBinLin
implementation contains two bugs, one that can result in failing to identify the maximal value in the first and last bin (see issue #46) and one that fails to assign a value to a bin (issue #49). Both are fixed in the do_findChromPeaks_matchedFilter
implementation.
A detailed description of tests comparing all implementations is available in issue #52 on github. Note also that in course of these changes also the getEIC
method has been updated to use the new binning and missing value imputation function.
While it is strongly discouraged, it is still possible to use to old code (from 1.49.7) by calling useOriginalCode(TRUE)
.
findPeaks.massifquant
scanrange
was ignored in the original old code (issue #61).matrix
if withWave
was 0
and a xcmsPeaks
object otherwise. The updated version returns always an xcmsPeaks
object (issue #60).Retention time correction using the obiwarp method uses the profile matrix (i.e. intensities binned in discrete bins along the mz axis). Profile matrix generation uses now the binYonX
method which fixed some problems in the original binning and linear interpolation methods. Thus results might be slightly different.
Also, the retcor.obiwarp
method reports (un-rounded) adjusted retention times, but adjusts the retention time of eventually already identified peaks using rounded adjusted retention times. The new adjustRtime
method(s) does adjust identified peaks using the reported adjusted retention times (not rounded). This guarantees that e.g. removing retention time adjustment/alignment results from an object restores the object to its initial state (i.e. the adjusted retention times of the identified peaks are reverted to the retention times before alignment). See issue #122 for more details.
retcor.peaksgroups
: change in the way how well behaved peak groups are orderedThe retcor.peakgroups
defines first the chromatographic peak groups that are used for the alignment of all spectra. Once these are identified, the retention time of the peak with the highest intensity in a sample for a given peak group is returned and the peak groups are ordered increasingly by retention time (which is required for the later fitting of either a polynomial or a linear model to the data). The selection of the retention time of the peak with the highest intensity within a feature (peak group) and samples, denoted as representative peak for a given feature in a sample, ensures that only the retention time of a single peak per sample and feature is selected (note that multiple chromatographic peaks within the same sample can be assigned to a feature). In the original code the ordering of the peak groups was however performed using the median retention time of the complete peak group (which includes also potential additional peaks per sample). This has been changed and the features are ordered now by the median retention time across samples of the representative chromatographic peaks.
scanrange
parameter in all findPeaks
methodsThe scanrange
in the findPeaks
methods is supposed to enable the peak detection only within a user-defined range of scans. This was however not performed in each method. Due to a bug in findPeaks.matchedFilter
’s original code the argument was ignored, except if the upper scan number of the user defined range was larger than the total number of available scans (see issue #63). In findPeaks.massifquant
the argument was completely ignored (see issue #61) and, while the argument was considered in findPeaks.centWave
and feature detection was performed within the specified scan range, but the original @scantime
slot was used throughout the code instead of just the scan times for the specified scan indices (see issue #64).
These problems have been fixed in version 1.51.1 by first sub-setting the xcmsRaw
object (using the [
method) before actually performing the feature detection.
fillPeaks
(fillChromPeaks
) differencesIn the original fillPeaks.MSW
, the mz range from which the signal is to be integrated was defined using
mzarea <- seq(which.min(abs(mzs - peakArea[i, "mzmin"])),
which.min(abs(mzs - peakArea[i, "mzmax"])))
Depending on the data this could lead to the inclusion of signal in the integration that are just outside of the mz range. In the new fillChromPeaks
method signal is integrated only for mz values >= mzmin and <= mzmax thus ensuring that only signal is used that is truly within the peak area defined by columns "mzmin"
, "mzmax"
, "rtmin"
and "rtmax"
.
Also, the fillPeaks.chrom
method did return "into"
and "maxo"
values of 0
if no signal was found in the peak area. The new method does not integrate any signal in such cases and does not fill in that peak.
See also issue #130 for more information.
These changes and updates will not have any large impact on the day-to-day use of xcms
and are listed here for completeness.
xcms
version 1.51.1 on the default methods from the mzR
package are used for data import. Besides ensuring easier maintenance, this enables also data import from gzipped mzML files.Here we list all of the functions and related files that are deprecated.
xcmsParallelSetup
, xcmsPapply
, xcmsClusterApply
: use BiocParallel
package instead to setup and perform parallel processing, either via the BPPARAM
parameter to function and methods, or by calling register
to globally set parallel processing.
profBin
, profBinM
, profBinLin
, profBinLinM
, profBinLinBase
, profBinLinBaseM
: replaced by the binYonX
and imputeLinInterpol
functions. Also, to create or extract the profile matrix from an xcmsRaw
object, the profMat
method.
xcmsParallelSetup
(Deprecated.R)xcmsPapply
(Deprecated.R)xcmsClusterApply
(Deprecated.R)profBin
(c.R)profBinM
(c.R)profBinLin
(c.R)profBinLinM
(c.R)profBinLinBase
(c.R)profBinLinBaseM
(c.R)1. Tautenhahn R, Böttcher C, Neumann S: Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 2008, 9:504.
2. Prince JT, Marcotte EM: Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry 2006, 78:6140–6152.
3. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry 2006, 78:779–787.