HDCytoData 1.4.0
The HDCytoData
data package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) datasets, formatted into SummarizedExperiment
and flowSet
Bioconductor object formats. The data objects are hosted on the Bioconductor ExperimentHub web resource.
The objects contain the cell-level expression values, as well as row and column metadata, including sample IDs, group IDs, true cell population labels or cluster labels (where available), channel names, protein marker names, and protein marker classes (cell type or cell state).
These datasets have been used for benchmarking purposes in our previous work and publications, e.g. to benchmark clustering algorithms or methods for differential analysis. They are provided here in the SummarizedExperiment
and flowSet
formats to make them easier to access.
The package contains the following datasets, which can be grouped into datasets useful for benchmarking either (i) clustering algorithms or (ii) methods for differential analysis.
Additional details on each dataset are included in the help files for the datasets. For each dataset, this includes a description of the dataset (biological context, number of samples, number of cells, number of manually gated cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, and references and raw data sources.
The help files can be accessed by the dataset names, e.g. ?Bodenmiller_BCR_XL
.
This section shows how to load the datasets, using one of the datasets (Bodenmiller_BCR_XL
) as an example.
The datasets can be loaded either with named functions referring directly to the object names, or by using the ExperimentHub
interface. Both methods are demonstrated below.
See the help files (e.g. ?Bodenmiller_BCR_XL
) for details about the structure of the SummarizedExperiment
or flowSet
objects.
Load the datasets using named functions:
suppressPackageStartupMessages(library(HDCytoData))
## snapshotDate(): 2019-04-29
# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()
## snapshotDate(): 2019-04-29
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## 'EH2254 : 2254'
## class: SummarizedExperiment
## dim: 172791 35
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()
## snapshotDate(): 2019-04-29
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## 'EH2255 : 2255'
## A flowSet with 16 experiments.
##
## column names:
## Time Cell_length CD3(110:114)Dd CD45(In115)Dd BC1(La139)Dd BC2(Pr141)Dd pNFkB(Nd142)Dd pp38(Nd144)Dd CD4(Nd145)Dd BC3(Nd146)Dd CD20(Sm147)Dd CD33(Nd148)Dd pStat5(Nd150)Dd CD123(Eu151)Dd pAkt(Sm152)Dd pStat1(Eu153)Dd pSHP2(Sm154)Dd pZap70(Gd156)Dd pStat3(Gd158)Dd BC4(Tb159)Dd CD14(Gd160)Dd pSlp76(Dy164)Dd BC5(Ho165)Dd pBtk(Er166)Dd pPlcg2(Er167)Dd pErk(Er168)Dd BC6(Tm169)Dd pLat(Er170)Dd IgM(Yb171)Dd pS6(Yb172)Dd HLA-DR(Yb174)Dd BC7(Lu175)Dd CD7(Yb176)Dd DNA-1(Ir191)Dd DNA-2(Ir193)Dd group_id patient_id sample_id population_id
Alternatively, load the datasets using the ExperimentHub
interface:
# Create an ExperimentHub instance
ehub <- ExperimentHub()
## snapshotDate(): 2019-04-29
# Query ExperimentHub instance to find datasets
query(ehub, "HDCytoData")
## ExperimentHub with 16 records
## # snapshotDate(): 2019-04-29
## # $dataprovider: NA
## # $species: Homo sapiens, Mus musculus
## # $rdataclass: SummarizedExperiment, flowSet
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH2240"]]'
##
## title
## EH2240 | Levine_32dim_SE
## EH2241 | Levine_32dim_flowSet
## EH2242 | Levine_13dim_SE
## EH2243 | Levine_13dim_flowSet
## EH2244 | Samusik_01_SE
## ... ...
## EH2251 | Mosmann_rare_flowSet
## EH2252 | Krieg_Anti_PD_1_SE
## EH2253 | Krieg_Anti_PD_1_flowSet
## EH2254 | Bodenmiller_BCR_XL_SE
## EH2255 | Bodenmiller_BCR_XL_flowSet
# Load 'SummarizedExperiment' object using index of dataset
ehub[["EH2254"]]
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## 'EH2254 : 2254'
## class: SummarizedExperiment
## dim: 172791 35
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
# Load 'flowSet' object using index of dataset
ehub[["EH2255"]]
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## 'EH2255 : 2255'
## A flowSet with 16 experiments.
##
## column names:
## Time Cell_length CD3(110:114)Dd CD45(In115)Dd BC1(La139)Dd BC2(Pr141)Dd pNFkB(Nd142)Dd pp38(Nd144)Dd CD4(Nd145)Dd BC3(Nd146)Dd CD20(Sm147)Dd CD33(Nd148)Dd pStat5(Nd150)Dd CD123(Eu151)Dd pAkt(Sm152)Dd pStat1(Eu153)Dd pSHP2(Sm154)Dd pZap70(Gd156)Dd pStat3(Gd158)Dd BC4(Tb159)Dd CD14(Gd160)Dd pSlp76(Dy164)Dd BC5(Ho165)Dd pBtk(Er166)Dd pPlcg2(Er167)Dd pErk(Er168)Dd BC6(Tm169)Dd pLat(Er170)Dd IgM(Yb171)Dd pS6(Yb172)Dd HLA-DR(Yb174)Dd BC7(Lu175)Dd CD7(Yb176)Dd DNA-1(Ir191)Dd DNA-2(Ir193)Dd group_id patient_id sample_id population_id
Once the datasets have been loaded from ExperimentHub, they can be used as normal within an R session. For example, using the SummarizedExperiment
form of the dataset loaded above:
# Load dataset in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()
## snapshotDate(): 2019-04-29
## see ?HDCytoData and browseVignettes('HDCytoData') for documentation
## downloading 0 resources
## loading from cache
## 'EH2254 : 2254'
# Inspect the object
d_SE
## class: SummarizedExperiment
## dim: 172791 35
## metadata(2): experiment_info n_cells
## assays(1): exprs
## rownames: NULL
## rowData names(4): group_id patient_id sample_id population_id
## colnames(35): Time Cell_length ... DNA-1 DNA-2
## colData names(3): channel_name marker_name marker_class
assay(d_SE)[1:6, 1:6]
## Time Cell_length CD3 CD45 BC1 BC2
## [1,] 33073 30 120.823265 454.6009 576.8983 10.0057297
## [2,] 36963 35 135.106171 624.6824 564.6299 5.5991135
## [3,] 37892 30 -1.664619 601.0125 3077.2668 1.7105789
## [4,] 41345 58 115.290245 820.7125 6088.5967 22.5641403
## [5,] 42475 35 14.373802 326.6405 4606.6929 -0.6584854
## [6,] 44620 31 37.737877 557.0137 4854.1519 -0.4517288
rowData(d_SE)
## DataFrame with 172791 rows and 4 columns
## group_id patient_id sample_id population_id
## <factor> <factor> <factor> <factor>
## 1 BCR-XL patient1 patient1_BCR-XL CD4 T-cells
## 2 BCR-XL patient1 patient1_BCR-XL CD4 T-cells
## 3 BCR-XL patient1 patient1_BCR-XL NK cells
## 4 BCR-XL patient1 patient1_BCR-XL CD4 T-cells
## 5 BCR-XL patient1 patient1_BCR-XL CD8 T-cells
## ... ... ... ... ...
## 172787 Reference patient8 patient8_Reference CD8 T-cells
## 172788 Reference patient8 patient8_Reference CD4 T-cells
## 172789 Reference patient8 patient8_Reference CD4 T-cells
## 172790 Reference patient8 patient8_Reference CD4 T-cells
## 172791 Reference patient8 patient8_Reference CD8 T-cells
colData(d_SE)
## DataFrame with 35 rows and 3 columns
## channel_name marker_name marker_class
## <character> <character> <factor>
## Time Time Time none
## Cell_length Cell_length Cell_length none
## CD3 CD3(110:114)Dd CD3 type
## CD45 CD45(In115)Dd CD45 type
## BC1 BC1(La139)Dd BC1 none
## ... ... ... ...
## HLA-DR HLA-DR(Yb174)Dd HLA-DR type
## BC7 BC7(Lu175)Dd BC7 none
## CD7 CD7(Yb176)Dd CD7 type
## DNA-1 DNA-1(Ir191)Dd DNA-1 none
## DNA-2 DNA-2(Ir193)Dd DNA-2 none
Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transforms include the asinh
with cofactor
parameter equal to 5 for mass cytometry (CyTOF) data, or 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2).
Interactive visualizations to explore the datasets can be generated from the SummarizedExperiment
objects using the iSEE (“Interactive SummarizedExperiment Explorer”) package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the SummarizedExperiment
format. For more details, see the iSEE
package vignettes.