With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widley accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.
Please cite the below if you find this tool useful for you.
Mayakonda, A. & Koeffler, H.P. Maftools: Efficient analysis, visualization and summarization of MAF files
from large-scale cohort based cancer studies. bioRxiv (2016). doi: http://dx.doi.org/10.1101/052662
MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields.
Mandatoty fields: Hugo_Symbol, Chromosome, Start_Position, End_Position, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.
Recommended optional fields: non MAF specific fields containing vaf and amino acid change information.
Complete specififcation of MAF files can be found on NCI TCGA page.
This vignette demonstrates the usage and application of maftools on an example MAF file from TCGA LAML cohort 1.
read.maf
reads MAF files, summarizes it in various ways and stores it as an MAF object.
suppressPackageStartupMessages(require(maftools))
#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE)
Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and an oncomatrix which is useful to plot oncoplots (aka waterfall plots). There are accessor methods to access the useful slots from MAF object. However, all slots can be accessed using @
, just like most of S4 objects.
#Typing laml shows basic summary of MAF file.
laml
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build 37 NA NA
## 2: Center genome.wustl.edu NA NA
## 3: Samples 192 NA NA
## 4: nGenes 1241 NA NA
## 5: Frame_Shift_Del 52 0.27083333 0
## 6: Frame_Shift_Ins 91 0.47395833 0
## 7: In_Frame_Del 10 0.05208333 0
## 8: In_Frame_Ins 42 0.21875000 0
## 9: Missense_Mutation 1342 6.98958333 7
## 10: Nonsense_Mutation 103 0.53645833 0
## 11: Splice_Site 92 0.47916667 0
## 12: total 1732 9.02083333 9
#Shows sample summry.
getSampleSummary(laml)
## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: TCGA.AB.3009 0 5 0
## 2: TCGA.AB.2807 1 0 1
## 3: TCGA.AB.2959 0 0 0
## 4: TCGA.AB.3002 0 0 0
## 5: TCGA.AB.2849 0 1 0
## ---
## 188: TCGA.AB.2933 0 0 0
## 189: TCGA.AB.2942 0 0 0
## 190: TCGA.AB.2946 0 0 0
## 191: TCGA.AB.2954 0 0 0
## 192: TCGA.AB.2982 0 0 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 1 25 2 1 34
## 2: 0 16 3 4 25
## 3: 0 22 0 1 23
## 4: 0 15 1 5 21
## 5: 0 16 1 2 20
## ---
## 188: 0 1 0 0 1
## 189: 1 0 0 0 1
## 190: 0 1 0 0 1
## 191: 0 1 0 0 1
## 192: 0 1 0 0 1
#Shows frequently mutated genes.
getGeneSummary(laml)
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: FLT3 0 0 1
## 2: DNMT3A 4 0 0
## 3: NPM1 0 33 0
## 4: IDH2 0 0 0
## 5: IDH1 0 0 0
## ---
## 1237: ZNF689 0 0 0
## 1238: ZNF75D 0 0 0
## 1239: ZNF827 1 0 0
## 1240: ZNF99 0 0 0
## 1241: ZPBP 0 0 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 33 15 0 3 52
## 2: 0 39 5 6 54
## 3: 0 1 0 0 34
## 4: 0 20 0 0 20
## 5: 0 18 0 0 18
## ---
## 1237: 0 1 0 0 1
## 1238: 0 1 0 0 1
## 1239: 0 0 0 0 1
## 1240: 0 1 0 0 1
## 1241: 0 1 0 0 1
## MutatedSamples
## 1: 52
## 2: 48
## 3: 33
## 4: 20
## 5: 18
## ---
## 1237: 1
## 1238: 1
## 1239: 1
## 1240: 1
## 1241: 1
#Shows all fields in MAF
getFields(laml)
## [1] "Hugo_Symbol" "Entrez_Gene_Id"
## [3] "Center" "NCBI_Build"
## [5] "Chromosome" "Start_Position"
## [7] "End_Position" "Strand"
## [9] "Variant_Classification" "Variant_Type"
## [11] "Reference_Allele" "Tumor_Seq_Allele1"
## [13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode"
## [15] "Protein_Change" "i_TumorVAF_WU"
## [17] "i_transcript_name"
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')
We can use plotmafSummary
to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort.
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Bettter representaion of maf file can be shown as oncoplots, also known as waterfall plots. Oncoplot function uses ComplexHeatmap to draw oncoplots2. To be specific, oncoplot
is a wrapper around ComplexHeatmap’s OncoPrint
function with little modification and automation which makes plotting easier. Side barplot and top barplots can be controlled by drawRowBar
and drawColBar
arguments respectivelly.
#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml, top = 10, removeNonMutated = TRUE)
NOTE: Variants annotated as Multi_Hit
are those genes which are mutated more than once in the same sample.
If we have copy number data along with MAF file, we can include them in oncoplot as to show if any genes are amplified or deleted. One most widely used tool for copy number analysis from large scale studies is GISTIC and we can simultaneously read gistic results along with MAF3. GISTIC generates numerous files but we need mainly three files all_lesions.conf_XX.txt
, amp_genes.conf_XX.txt
and del_genes.conf_XX.txt
, where XX is confidence level. These files contain significantly altered genomic regions along with amplified and delted genes respectively.
#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
laml.plus.gistic = read.maf(maf = laml.maf, removeSilent = TRUE, useAll = FALSE, gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, isTCGA = TRUE)
#We will draw oncoplots for top ten mutated genes. (Removing non-mutated samples from the plot for better visualization)
oncoplot(maf = laml.plus.gistic, top = 10, removeNonMutated = TRUE, sortByMutation = TRUE)
This plot shows frequent deletions in TP53 gene which is located on one of the significanlty deleted locus 17p13.2.
It is often the case that we include meta data to show sample characterstics such as gender, treatment, etc. We can include such meta data by passing them to annotation
argument of oncoplot
. We can also change colors for Variant_Classification by providing a named vector of colors.
#Read FAB classification of TCGA LAML barcodes.
laml.fab.anno = system.file('extdata', 'tcga_laml_fab_annotation.txt', package = 'maftools')
laml.fab.anno = read.delim(laml.fab.anno, sep = '\t')
head(laml.fab.anno)
## Tumor_Sample_Barcode FAB_classification
## 1 TCGA-AB-2802 M4
## 2 TCGA-AB-2803 M3
## 3 TCGA-AB-2804 M3
## 4 TCGA-AB-2805 M0
## 5 TCGA-AB-2806 M1
## 6 TCGA-AB-2807 M1
#Changing colors (You can use any colors, here in this example we will use a color palette from RColorBrewer)
col = RColorBrewer::brewer.pal(n = 8, name = 'Paired')
names(col) = c('Frame_Shift_Del','Missense_Mutation', 'Nonsense_Mutation', 'Multi_Hit', 'Frame_Shift_Ins',
'In_Frame_Ins', 'Splice_Site', 'In_Frame_Del')
#We will plot same top ten mutated genes with FAB classification as annotation and using above defined colors.
oncoplot(maf = laml, top = 10, annotation = laml.fab.anno, removeNonMutated = TRUE, colors = col)
We can visualize any set of genes using oncostrip
function, which draws mutations in each sample similar to OncoPrinter tool on cBioPortal. oncostrip
can be used to draw any number of genes using top
or genes
arguments.
oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'), removeNonMutated = TRUE, showTumorSampleBarcodes = FALSE)
titv
function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visulaized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)
Lollipop plots are simple and most effective way showing mutation spots on protein structure. Many oncogenes have a preferential sites which are mutated more often than any other locus. These spots are considered to be mutational hotspots and lollipop plots can be used to display them along with rest of the mutations. We can draw such plots using the function lollipopPlot
. This fuction requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, lollipopPlot
looks for column AAChange
, and if its not found in the MAF file, it prints all availble fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name ‘Protein_Change’. We will manually specify this using argumnet AACol
. This function also returns the plot as a ggplot object, which user can later modify if needed.
#Lets plot lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
dnmt3a.lpop = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)
## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629 NP_783328 912
## 2: DNMT3A NM_022552 NP_072046 912
## 3: DNMT3A NM_153759 NP_715640 723
## Using longer transcript NM_175629 for now.
## Removed 3 mutations for which AA position was not available
Note that lollipopPlot
warns user on availability of diferent transcripts for the given gene. If we know the transcript id before hand, we can specify it as refSeqID
or proteinID
. By default lollipopPlot uses the longer isoform.
We can also label points on the lollipopPlot
using argument labelPos
. If labelPos
is set to ‘all’, all of the points are highlighted. Sometimes, many mutations are clustered within a range of few amino acid positons. In that case we can use repel
option which tries to repel points for clear representation.
#Lets plot mutations on KIT gene, without repel option.
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', domainLabelSize = 3)
#Same plot with repel=TRUE
kit.lpop = lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = c(416, 418), refSeqID = 'NM_000222', repel = TRUE, domainLabelSize = 3)
MutationMapper on cBioPortal collapses Variant Classfications into truncating and others. It also includes somatic mutation rate.
laml.dnmt3a = lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', refSeqID = 'NM_175629', labelPos = 882, collapsePosLabel = TRUE, cBioPortal = TRUE, domainLabelSize = 3, defaultYaxis = FALSE)
Many cancer genomic studies involve copy number data generated either from sequencing or SNP chip arrays. Copy number analysis provides us a lot of information from genome wide copy number aberations to tumor purity. Most of the time copy number data is stored as segments, with each segment represented by a log ratio of copy number changes compared to matched normals. There are many segmentation algorithms, with the most popular being Circular Binary Segmentation implemented in DNACopy bioconductor 4. We can plot such segmented copy number data and map all mutations on to it. It provides a quick way of knowing which variants (in a way which genes) are located on copy number altered genomic regions.
tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg, maf = laml, labelAll = TRUE)
## Warning: Ignoring unknown aesthetics: label
## Warning: Removed 30 rows containing missing values (geom_segment).
Above plot shows two genes NF1 and SUZ12 are located on a region which has a shallow deletion. Later we will also see how these variants on copy number altered regions affect variant allele frequencies and tumor heterogeneity estimation.
Cancer genomes, especially solid tumors are characterized by genomic loci with localized hypermutations 5. Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using rainfallPlot
. If detectChangePoints
is set to TRUE, rainfall
plot also highlights regions where potential changes in inter-event distances are located. But please be aware that detected change-points are only loci where the distribution of inter-event distance changes. Segments may have to be manually inferred by adjacent change-points. This will be improved in future updates.
coad <- system.file("extdata", "coad.maf.gz", package = "maftools")
coad = read.maf(maf = coad)
coad.rf = rainfallPlot(maf = coad, detectChangePoints = TRUE, fontSize = 12, pointSize = 0.6)
## Change points detected at:
## Chromosome Position
## 1: 2 179108701
## 2: 2 179682996
## 3: 5 140161768
## 4: 5 140998034
## 5: 19 8907491
## 6: 19 9065844
## 7: 20 3630019
TCGA contains over 30 different cancer cohrts and median mutation load across them varies from as low as 7 per exome (Pheochromocytoma and Paraganglioma arising from Adrenal Gland) to as high as 315 per exome (Skin Cutaneoys Melanoma). It is informative to see how mutation load in given maf stands against TCGA cohorts. This can can be achieved with the fuction tcgaComapre
which draws distribution of variants compiled from over 10,000 WXS samples across 33 TCGA landmark cohorts. Plot generated is similar to the one described in Alexandrov et al 5.
laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')
## Summary..
## Cohort Cohort_Size Median_Mutations
## 1: SKCM 468 315.0
## 2: LUSC 494 187.5
## 3: LUAD 567 158.0
## 4: BLCA 412 131.5
## 5: ESCA 184 85.0
## 6: HNSC 509 82.0
## 7: STAD 439 82.0
## 8: DLBC 48 81.5
## 9: UCEC 542 78.0
## 10: COAD 433 76.0
## 11: OV 443 72.0
## 12: LIHC 374 67.0
## 13: CESC 305 66.0
## 14: READ 158 63.0
## 15: KIRP 288 53.0
## 16: KIRC 339 44.0
## 17: UCS 57 35.0
## 18: BRCA 1044 34.0
## 19: GBM 395 32.0
## 20: SARC 255 31.0
## 21: CHOL 51 30.0
## 22: MESO 83 25.0
## 23: PAAD 178 22.5
## 24: ACC 92 21.5
## 25: LGG 511 19.0
## 26: PRAD 496 19.0
## 27: KICH 66 13.0
## 28: TGCT 149 11.0
## 29: THYM 123 11.0
## 30: LAML 143 10.0
## 31: UVM 80 10.0
## 32: THCA 491 9.0
## 33: Example-LAML 192 9.0
## 34: PCPG 177 7.0
## Cohort Cohort_Size Median_Mutations
We can plot word cloud plot for mutated genes with the function geneCloud
. Size of each gene is proportional to the total number of samples in which it is mutated/altered.
geneCloud(input = laml, minMut = 3)
This function plots Variant Allele Frequecies as a boxplot which quickly helps to estimate clonal status of top mutated genes (clonal genes usually have mean allele frequncy around ~50% assuming pure sample)
vafPlot = plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU', flip = TRUE)
We can summarize output files generated by GISTIC programme. As mentioned earlier, we need three files that were generated by GISTIC, i.e, all_lesions.conf_XX.txt, amp_genes.conf_XX.txt and del_genes.conf_XX.txt, where XX is the confidence level. See GISTIC documentation for details.
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, isTCGA = TRUE)
## Processing Gistic files..
## Processing Amplified Genes..
## Processing Deleted Genes..
## Summarizing samples..
Similar to MAF objects, there are methods available to access slots of GISTIC object - getSampleSummary
, getGeneSummary
and getCytoBandSummary
. Summarized results can be written to output files using function write.GisticSummary
.
Similar to oncoplots we can draw copy number data.
gisticPlot(gistic = laml.gistic)
A bubble plot to display summarized gistic results.
gr = plotGisticResults(gistic = laml.gistic)
Many disease causing genes in cancer show strong exlcusiveness in their mutation pattern. Such mutually exlcuive set of genes can be detected using mutExclusive
function, which performs an exact test to detect such significant pair of genes. mutExclusive
uses comet_exact_test
on a given set of genes to calculate significance value. Please cite CoMET article if you use this function 6.
#We will run mutExclusive on top 10 mutated genes.
laml.mut.excl = mutExclusive(maf = laml, top = 10)
head(laml.mut.excl)
## n.00 n.01 n.10 n.11 gene1 gene2 pval
## 1 125 15 52 0 FLT3 TP53 0.00352781328800287
## 2 139 20 33 0 NPM1 IDH2 0.0092131137152039
## 3 143 16 33 0 NPM1 RUNX1 0.0213098782919558
## 4 125 15 51 1 FLT3 RUNX1 0.021565502698412
## 5 144 15 33 0 NPM1 TP53 0.0261933920671912
## 6 129 15 47 1 DNMT3A RUNX1 0.0319088921944485
We can visualize the above results using oncostrip
. For example in above mutExclusive
analysis, we can see many genes show exclusiveness. For example NPM1 and RUNX1 show a strong exclusiveness with a p-value of 0.02.
oncostrip(maf = laml, genes = c('NPM1', 'RUNX1'), sort = TRUE, removeNonMutated = TRUE)
maftools has a function oncodrive
which identifies cancer genes (driver) from a given MAF. oncodrive
is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hotspots). This method takes advantage of such positions to identify cancer genes. If you use this function, please cite OncodriveCLUST article 7.
laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')
head(laml.sig)
We can plot the results using plotOncodrive
.
plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE)
plotOncodrive
plots the results as scatter plot with size of the points proportional to the number of clusters found in the gene. X-axis shows number of mutations (or fraction of mutations) observed in these clusters. In the above example, IDH1 has a single cluster and all of the 18 mutations are accumulated within that cluster, giving it a cluster score of one. For details on oncodrive algorithm, please refer to OncodriveCLUST article 7.
maftools comes with the function pfamDomains
, which adds pfam domain information to the amino acid changes. pfamDomain
also summarizes amino acid changes accoriding to the domains that are affected. This serves the puposes of knowing what domain in given cancer cohort, is most frequently affected. This function is inspired from Pfam annotation modulce from MuSic tool 8.
laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)
## Removed 50 mutations for which AA position was not available
#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = FALSE]
## HGNC AAPos Variant_Classification N total fraction DomainLabel
## 1: DNMT3A 882 Missense_Mutation 27 54 0.5000000 AdoMet_MTases
## 2: IDH1 132 Missense_Mutation 18 18 1.0000000 PTZ00435
## 3: IDH2 140 Missense_Mutation 17 20 0.8500000 PTZ00435
## 4: FLT3 835 Missense_Mutation 14 52 0.2692308 PKc_like
## 5: FLT3 599 In_Frame_Ins 10 52 0.1923077 PKc_like
## ---
## 1512: ZNF646 875 Missense_Mutation 1 1 1.0000000 NA
## 1513: ZNF687 554 Missense_Mutation 1 2 0.5000000 NA
## 1514: ZNF687 363 Missense_Mutation 1 2 0.5000000 NA
## 1515: ZNF75D 5 Missense_Mutation 1 1 1.0000000 NA
## 1516: ZNF827 427 Frame_Shift_Del 1 1 1.0000000 NA
#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = FALSE]
## DomainLabel nMuts nGenes
## 1: PKc_like 55 5
## 2: PTZ00435 38 2
## 3: AdoMet_MTases 33 1
## 4: 7tm_1 24 24
## 5: COG5048 17 17
## ---
## 499: ribokinase 1 1
## 500: rim_protein 1 1
## 501: sigpep_I_bact 1 1
## 502: trp 1 1
## 503: zf-BED 1 1
Above plot and results shows AdoMet_MTases domain is frequently mutated, but number genes with this domain is just one (DNMT3A) compared to other domains such as 7tm_1 domain, which is mutated across 24 different genes. This shows the importance of mutations in methyl transfer domains in Leukemia.
Lawrence et al performed MutSigCV analysis on 21 cancer cohorts and identified over 200 genes to be significantly mutated which consists of previously undescribed novel genes 9. Their results show only few genes are mutated in multiple cohort while many of them are tissue/cohort specific. We can compare mutSig results against this pan-can list of significantly mutated genes to see genes specifically mutated in given cohort. This function requires MutSigCV results (usually named sig_genes.txt) as an input.
#MutsigCV results for TCGA-AML
laml.mutsig <- system.file("extdata", "LAML_sig_genes.txt.gz", package = "maftools")
laml.pancan = pancanComparision(mutsigResults = laml.mutsig, qval = 0.1, cohortName = 'LAML', inputSampleSize = 200, label = 1, normSampleSize = TRUE)
## Significantly mutated genes in LAML (q < 0.1): 23
## Significantly mutated genes in PanCan cohort (q <0.1): 114
## Significantly mutated genes exclusive to LAML (q < 0.1):
## gene pancan q nMut
## 1: CEBPA 1.000 3.500301e-12 13
## 2: EZH2 1.000 7.463546e-05 3
## 3: GIGYF2 1.000 6.378338e-03 2
## 4: KIT 0.509 1.137517e-05 8
## 5: PHF6 0.783 6.457555e-09 6
## 6: PTPN11 0.286 7.664584e-03 9
## 7: RAD21 0.929 1.137517e-05 5
## 8: SMC1A 0.801 2.961696e-03 6
## 9: TET2 0.907 2.281625e-13 17
## 10: WT1 1.000 2.281625e-13 12
Above results show genes such as CEBPa, TET2 and WT1 are specifically mutated in Leukemia.
Survival analysis is an essential part of cohort based sequencing projects. Function mafSurvive
performs survival analysis and draws kaplan meier curve by grouping samples based on mutation status of user defined gene(s). This function requires input data to contain Tumor_Sample_Barcode (make sure they match to those in MAF file), binary event (canbe Dead/Alive, Relapse/Remission, etc) and time to event. Below example for TCGA LAML cohort consists of overall survival status of patients (Dead = 1, Alive = 0) and time till last followup.
laml.surv <- system.file("extdata", "laml_survival.tsv", package = "maftools")
laml.surv = read.delim(file = laml.surv, header = TRUE, stringsAsFactors = FALSE)
head(laml.surv)
## Tumor_Sample_Barcode days_to_last_followup Overall_Survival_Status
## 1 TCGA-AB-2802 365 1
## 2 TCGA-AB-2803 792 1
## 3 TCGA-AB-2804 2557 0
## 4 TCGA-AB-2805 577 1
## 5 TCGA-AB-2806 945 1
## 6 TCGA-AB-2807 181 1
#Survival analysis based on grouping of DNMT3A mutation status
laml.dnmt3a.surv = mafSurvival(maf = laml, clinicalData = laml.surv, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', showConfInt = TRUE)
## Number of mutated samples for given genes:
## DNMT3A
## 48
## Group medianTime N
## 1: Mutant 243 48
## 2: WT 365 152
Cancers differ from each other in terms of their mutation pattern. We can compare two different cohorts to detect such differentally mutated genes. For example, recent article by Madan et. al 9, have shown that patients with relapsed APL (Acute Promyelocytic Leukemia) tends to have mutations in PML and RARA genes, which were absent during primary stage of the disease. This difference between two cohorts (in this case primary and relapse APL) can be detected using function mafComapre
, which performs fisher test on all genes between two cohorts to detect differentally mutated genes.
#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)
#We will consider only genes which are mutated in at-least in 5 samples in one of the cohort, to avoid bias due to single mutated genes.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', minMut = 5)
print(pt.vs.rt)
## $results
## Hugo_Symbol PrimaryAPL RelapseAPL pval or ci.up
## 1: PML 1 11 1.529935e-05 0.03537381 0.000806034
## 2: RARA 0 7 2.574810e-04 0.00000000 0.000000000
## 3: RUNX1 1 5 1.310500e-02 0.08740567 0.001813280
## 4: FLT3 26 4 1.812779e-02 3.56086275 1.149009169
## 5: ARID1B 5 8 2.758396e-02 0.26480490 0.064804160
## 6: WT1 20 14 2.229087e-01 0.60619329 0.263440988
## 7: KRAS 6 1 4.334067e-01 2.88486293 0.337679367
## 8: NRAS 15 4 4.353567e-01 1.85209500 0.553883512
## 9: FLT3_ITD 45 19 7.395415e-01 1.16822368 0.578323868
## 10: ARID1A 7 4 7.457274e-01 0.80869223 0.195710173
## ci.low
## 1: 0.2552937
## 2: 0.3006159
## 3: 0.8076265
## 4: 14.7701728
## 5: 0.9698686
## 6: 1.4223101
## 7: 135.5393108
## 8: 8.0373994
## 9: 2.4084408
## 10: 3.9297309
##
## $SampleSummary
## Cohort SampleSize
## 1: PrimaryAPL 124
## 2: RelapseAPL 58
Above resutls show two genes PML and RARA which are highly mutated in Relapse APL compared to Primary APL. We can visulaize these results as a forestplot.
apl.pt.vs.rt.fp = forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.05, show = 'stat', color = c('royalblue', 'maroon'))
Another alternative way of displaying above rsults is by plotiing two oncoplots side by side. coOncoplot
function takes two maf objects and plots them side by side for better comparision.
genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)
Tumors are generally heterogenous i.e, consist of multiple clones. This heterogenity can be inferred by clustering variant allele frequencies. inferHeterogeneity
function uses vaf information to cluster variants (using mclust
), to infer clonality. By default, inferHeterogeneity
function looks for column t_vaf containing vaf information. However, if the field name is different from t_vaf, we can manually specify it using argument vafCol
. For example, in this case study vaf is stored under the field name i_TumorVAF_WU.
#We will run this for sample TCGA.AB.2972
tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA.AB.2972', vafCol = 'i_TumorVAF_WU')
## Processing TCGA.AB.2972..
print(tcga.ab.2972.het$clusterMeans)
## Tumor_Sample_Barcode cluster meanVaf
## 1: TCGA.AB.2972 2 0.4496571
## 2: TCGA.AB.2972 1 0.2454750
## 3: TCGA.AB.2972 outlier 0.3695000
#Visualizing results
plotClusters(clusters = tcga.ab.2972.het)
Above figure shows clear seperation of two clones clustered at mean variant allele frequencies of ~45% (major clone) and another minor clone at variant allele frequency of ~25%.
We can use copy number information to ignore variants located on copynumber altered regions. Copy number alterations results in abnormally high/low variant allele frequencies, which tends to affect clustering. Removing such variants improves clustering and density estimation while retaining biologically meaningful results. Copy number information can be provided as a segmented file generated from segmentation programmes, such as Circular Binary Segmentation from DNACopy Bioconductor package 6.
seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')
tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA.AB.3009', segFile = seg, vafCol = 'i_TumorVAF_WU')
## Copy number data found for samples:
## [1] "TCGA.AB.3009"
## Processing TCGA.AB.3009..
## Removed 1 variants with no copy number data.
## Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
## 1: PHF6 23 133551224 133551224 TCGA.AB.3009
## t_vaf Segment_Start Segment_End Segment_Mean CN
## 1: 0.9349112 NA NA NA NA
## Copy number altered variants:
## Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
## 1: NF1 17 29562981 29562981 TCGA.AB.3009
## 2: SUZ12 17 30293198 30293198 TCGA.AB.3009
## t_vaf Segment_Start Segment_End Segment_Mean CN cluster
## 1: 0.8419000 29054355 30363868 -0.9157 1.060173 CN_altered
## 2: 0.8958333 29054355 30363868 -0.9157 1.060173 CN_altered
#Visualizing results. Highlighting those variants on copynumber altered variants.
plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)
Above figure shows two genes NF1 and SUZ12 with high VAF’s, which is due to copy number alteraions (deletion). Those two genes are ignored from analysis.
Although clustering of variant allele frequencies gives us a fair idea on heterogeneity, it is also possible to measure the extent of heterogeneity in terms of a numerical value. MATH score is a simple quantitative measure of intra-tumor heterogeneity, which calculates the width of the vaf distribution. Higher MATH scores are found to be associated with poor outcome. MATH score can also be used a proxy variable for survival analysis 10.
#we will specify for random 4 patients.
laml.math = math.score(maf = laml, vafCol = 'i_TumorVAF_WU',
sampleName = c('TCGA.AB.3009', 'TCGA.AB.2849', 'TCGA.AB.3002', 'TCGA.AB.2972'))
print(laml.math)
## Tumor_Sample_Barcode MATH MedianAbsoluteDeviation Frame_Shift_Del
## 1: TCGA.AB.2849 20.588348 5.170000 0
## 2: TCGA.AB.2972 11.621572 3.420000 0
## 3: TCGA.AB.3002 6.991207 2.104062 0
## 4: TCGA.AB.3009 9.045040 2.789054 0
## Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation
## 1: 1 0 0 16
## 2: 1 0 0 16
## 3: 0 0 0 15
## 4: 5 0 1 25
## Nonsense_Mutation Splice_Site total
## 1: 1 2 20
## 2: 2 1 20
## 3: 1 5 21
## 4: 2 1 34
From the above results, sample TCGA.AB.2849 has highest of MATH score (20.58) compared to rest of the three samples. It is also evident from the density plot, that vaf distribution is wider for this sample, whereas rest of three samples have sharp peaks with relatively low MATH scores, suggesting more homogeneity and lesser heterogeneity.
Every cancer, as it progresses leaves a signature characterised by specific pattern of nucleotide substitutions. Alexandrov et.al have shown such mutational signatures, derived from over 7000 cancer samples 5. Such signatures can be extracted by decomposiong matrix of nucleotide substitutions, classified into 96 substitution classes based on immediate bases sorrouding the mutated base. Extracted signatures can also be compared to those validated signatures.
First step in signature analysis is to obtain the adjacent bases sorounding the mutated base and form a mutation matrix. NOTE: Eventhough reading fasta and extracting bases is fairly fast, it is a memory consuming process as it occupies ~3gb of memory while running.
#First we extract adjacent bases to the mutated locus and clssify them into 96 substitution classes. This also estimates APOBEC enrichment per sample.
laml.tnm = trinucleotideMatrix(maf = laml, ref_genome = '/path/to/hg19.fa',
prefix = 'chr', add = TRUE, ignoreChr = 'chr23', useSyn = TRUE)
# reading /path/to/hg19.fa (this might take few minutes)..
# Extracting 5' and 3' adjacent bases..
# Extracting +/- 20bp around mutated bases for background estimation..
# Estimating APOBEC enrichment scores..
# Performing one-way Fisher's test for APOBEC enrichment..
# APOBEC related mutations are enriched in 2.674% of samples (APOBEC enrichment score > 2 ; 5 of 187 samples)
# Creating mutation matrix..
# matrix of dimension 193x96
Above function performs two steps:
APOBEC induced mutations are more frequent in solid tumors and are mainly associated with C>T transition events occuring in TCW motif. APOBEC enrichment scores in the above command are estimated using the method described by Roberts et al 12. Briefly, enrichment of C>T mutations occuring within TCW motif over all of the C>T mutations in a given sample is compared to background Cytosines and TCWs occuring within 20bp of mutated bases.
\[\frac{n_{tCw} * background_C}{n_C * background_{TCW}}\]
One-sided fishers exact test is also performed to statsitsically evaluate the enrichment score, as described in original study by Roberts et al.
plotSignatures(laml.tnm)
extractSignatures
uses non-negative matrix factorization to decompose nx96 dimesion matrix into r signatures, where n is the number of samples from input MAF 11. By default function runs nmf on 6 ranks and chooses the best possible value based on maximum cophenetic-correlation coefficients. It is also possible to manually specify r. Once decomposed, signatures are compared against known signatures derived from Alexandrov et.al, and cosine similarity is calculated to identify best match.
#Run main function with maximum 6 signatures.
require('NMF')
laml.sign = extractSignatures(mat = laml.tnm, nTry = 6, plotBestFitRes = FALSE)
# Warning : Found zero mutations for conversions A[T>G]C
# Comparing against experimentally validated 30 signatures.. (See http://cancer.sanger.ac.uk/cosmic/signatures for details.)
# Found Signature_1 most similar to validated Signature_1. CoSine-Similarity: 0.778739082321156
# Found Signature_2 most similar to validated Signature_1. CoSine-Similarity: 0.782748375695661
plotSignatures(laml.sign)
extractSignatures
gives a warning that no mutations are found for class A[T>G]C conversions. This is possible when the number of samples are low or in tumors with low mutation rate, such as in this case of Leukemia. In this scenario, a small positive value is added to avoid computational difficulties. It also prints other statistics for range of values that was tried, and chooses the rank with highest cophenetic metric (for above example r=2). Above stats should give an estimate of range of best possible r values and in case the chosen r is overestimating, it is also possible to be re-run extractSignatures
by manually specifying r.
Once decomposed, signatures are compared against known and validated signatures from Sanger 11. See here for list of validated signatures. In the above exaple, 2 signatures are derived, which are similar to validated Signature-1. Signature_1 is a result of elevated rate of spontaneous deamination of 5-methyl-cytosine, resulting in C>T transitions and predominantly occurs at NpCpG trinucleotide, which is a most common process in AML 12.
Full table of cosine similarities against validated signatures are also returned, which can be further analysed. Below plot shows comparision of similarities of detected signatures against validated signatures.
require('corrplot')
## Loading required package: corrplot
corrplot::corrplot(corr = laml.sign$coSineSimMat, col = RColorBrewer::brewer.pal(n = 9, name = 'Oranges'), is.corr = FALSE, tl.cex = 0.6, tl.col = 'black', cl.cex = 0.6)
NOTE: Should you recieve an error while running extractSignatures
complaining none of the packages are loaded
, please manually load the NMF
library and re-run.
We can also annotate variants using oncotator API 13. oncotate
function quires oncotator web api to annotate given set of variants and converts them into MAF format. Input should be a five column file with chr, start, end, ref_allele, alt_allele. However, it can conatain other information such as sample names (Tumor_Sample_Barcode), read counts, vaf information and so on, but only first five columns will be used, rest of the columns will be attached at the end of the table.
var.file = system.file('extdata', 'variants.tsv', package = 'maftools')
#This is what input looks like
var = read.delim(var.file, sep = '\t')
head(var)
## chromsome start end ref alt Tumor_Sample_Barcode
## 1 chr4 55589774 55589774 A G fake_1
## 2 chr4 55599321 55599321 A T fake_2
## 3 chr4 55599332 55599332 G T fake_3
## 4 chr4 55599320 55599320 G T fake_4
## 5 chr15 41961117 41961123 TGGCTAA - fake_4
## 6 chr4 55599320 55599320 G T fake_5
#Annotate
var.maf = oncotate(maflite = var.file, header = TRUE)
#Results from oncotate. First 20 columns.
var.maf[1:10, 1:20, with = FALSE]
NOTE: This is quite time consuming if input is big.
Annovar is one of the most widely used Variant Annotation tool in Genomics 14. Annovar output is generally in a tabular format with various annotation columns. This function converts such annovar output files into MAF. This function requires that annovar was run with gene based annotation as a first operation, before including any filter or region based annotations.
e.g, table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol (refGene),cytoBand,dbnsfp30a -operation (g),r,f -nastring NA
annovarToMaf
mainly uses gene based annotations for processing, rest of the annotation columns from input file will be attached to the end of the resulting MAF.
As an example we will annotate the same file which was used above to run oncotate
function. We will annotate it using annovar with the following command. For simplicity, here we are including only gene based annotations but one can include as many annotations as they wish. But make sure the fist operation is always gene based annotation.
$perl table_annovar.pl variants.tsv ~/path/to/humandb/ -buildver hg19
-out variants --otherinfo -remove -protocol ensGene -operation g -nastring NA
Output generated is stored as a part of this package. We can convert this annovar output into MAF using annovarToMaf
.
var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools")
var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19',
tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene')
## Converting Ensemble Gene IDs into HGNC gene symbols.
## Done! Original ensemble gene IDs are preserved under field name ens_id
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome
## 1: KIT NA CSI-NUS hg19 chr4
## 2: KIT NA CSI-NUS hg19 chr4
## 3: KIT NA CSI-NUS hg19 chr4
## 4: KIT NA CSI-NUS hg19 chr4
## 5: KIT NA CSI-NUS hg19 chr4
## 6: KIT NA CSI-NUS hg19 chr4
## 7: KIT NA CSI-NUS hg19 chr4
## 8: MGA NA CSI-NUS hg19 chr15
## 9: MGA NA CSI-NUS hg19 chr15
## 10: MGA NA CSI-NUS hg19 chr15
## Start_Position End_Position Strand Variant_Classification Variant_Type
## 1: 55589774 55589774 + Missense_Mutation SNP
## 2: 55599321 55599321 + Missense_Mutation SNP
## 3: 55599332 55599332 + Missense_Mutation SNP
## 4: 55599320 55599320 + Missense_Mutation SNP
## 5: 55599320 55599320 + Missense_Mutation SNP
## 6: 55599321 55599321 + Missense_Mutation SNP
## 7: 55599320 55599320 + Missense_Mutation SNP
## 8: 41989106 41989106 + Frame_Shift_Ins INS
## 9: 41961117 41961123 + Frame_Shift_Del DEL
## 10: 41989106 41989106 + Frame_Shift_Ins INS
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
## 1: A A G NA
## 2: A A T NA
## 3: G G T NA
## 4: G G T NA
## 5: G G T NA
## 6: A A T NA
## 7: G G C NA
## 8: - - TAAAGGGC NA
## 9: TGGCTAA TGGCTAA - NA
## 10: - - TAAAGGGC NA
## Tumor_Sample_Barcode Mutation_Status AAChange Transcript_Id
## 1: fake_1 Somatic p.D419G ENST00000412167
## 2: fake_2 Somatic p.D812V ENST00000412167
## 3: fake_3 Somatic p.D816Y ENST00000412167
## 4: fake_4 Somatic p.D812Y ENST00000412167
## 5: fake_5 Somatic p.D812Y ENST00000412167
## 6: fake_6 Somatic p.D812V ENST00000412167
## 7: fake_7 Somatic p.D812H ENST00000412167
## 8: fake_7 Somatic p.G633fs ENST00000566718
## 9: fake_4 Somatic p.L9fs ENST00000566718
## 10: fake_5 Somatic p.G633fs ENST00000566718
## TxChange GeneDetail.refGene hgnc_symbol Entrez
## 1: c.A1256G NA KIT NA
## 2: c.A2435T NA KIT NA
## 3: c.G2446T NA KIT NA
## 4: c.G2434T NA KIT NA
## 5: c.G2434T NA KIT NA
## 6: c.A2435T NA KIT NA
## 7: c.G2434C NA KIT NA
## 8: c.1898_1899insTAAAGGGC NA MGA NA
## 9: c.25_31del NA MGA NA
## 10: c.1898_1899insTAAAGGGC NA MGA NA
## ens_id
## 1: ENSG00000157404
## 2: ENSG00000157404
## 3: ENSG00000157404
## 4: ENSG00000157404
## 5: ENSG00000157404
## 6: ENSG00000157404
## 7: ENSG00000157404
## 8: ENSG00000174197
## 9: ENSG00000174197
## 10: ENSG00000174197
Annovar, when used with Ensemble as a gene annotation source, uses ensemble gene IDs as Gene names. In that case, use annovarToMaf
with argument table
set to ensGene
which converts ensemble gene IDs into HGNC symbols.
Just like TCGA, International Cancer Genome Consortium ICGC also makes its data publically available. But the data are stored in Simpale Somatic Mutation Format, which is similar to MAF format in its structure. However field names and classification of variants is different from that of MAF. icgcSimpleMutationToMAF
is a function which reads ICGC data and converts them to MAF.
#Read sample ICGC data for ESCA
esca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools")
esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE)
## Converting Ensemble Gene IDs into HGNC gene symbols.
## Done! Original ensemble gene IDs are preserved under field name ens_id
## NOTE: Removed 427 duplicated variants
#Printing first 16 columns for display convenience.
print(esca.maf[1:5,1:16, with = FALSE])
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
## 1: AC005237.4 NA NA GRCh37 2 241987787
## 2: AC061992.1 786 NA GRCh37 17 76425382
## 3: AC097467.2 NA NA GRCh37 4 156294567
## 4: ADAMTS12 NA NA GRCh37 5 33684019
## 5: AL589642.1 54801 NA GRCh37 9 32630154
## End_Position Strand Variant_Classification Variant_Type
## 1: 241987787 + Intron SNP
## 2: 76425382 + 3'Flank SNP
## 3: 156294567 + Intron SNP
## 4: 33684019 + Missense_Mutation SNP
## 5: 32630154 + RNA SNP
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
## 1: C C T NA
## 2: C C T NA
## 3: A A G NA
## 4: A A C NA
## 5: T T C NA
## dbSNP_Val_Status Tumor_Sample_Barcode
## 1: NA SA514619
## 2: NA SA514619
## 3: NA SA514619
## 4: NA SA514619
## 5: NA SA514619
Note that by default Simple Somatic Mutation format contains all affected transcripts of a variant resuting in multiple entries of the same variant in same sample. It is hard to choose a single affected transcript based on annotations alone and by default this program removes repeated variants as duplicated entries. If you wish to keep all of them, set removeDuplicatedVariants
to FALSE. Another option is to convert input file to MAF by removing duplicates and then use scripts like vcf2maf to re-annotate and prioritize affected transcripts.
MutSig/MutSigCV is most widely used program for detecting driver genes 16. However, we have observed that covariates files (gene.covariates.txt and exome_full192.coverage.txt) which are bundled with MutSig have non-standard gene names (non Hugo_Symbols). This discrepancy between Hugo_Symbols in MAF and non-Hugo_symbols in covariates file causes MutSig program to ignore such genes. For example, KMT2D - a well known driver gene in Esophageal Carcinoma is represented as MLL2 in MutSig covariates. This causes KMT2D to be ignored from analysis and is represented as an insignificant gene in MutSig results. This function attempts to correct such gene symbols with a manually curated list of gene names compatible with MutSig covariates list.
laml.mutsig.corrected = prepareMutSig(maf = laml)
# Converting gene names for 1 variants from 1 genes
# Hugo_Symbol MutSig_Synonym N
# 1: ARHGAP35 GRLF1 1
# Original symbols are preserved under column OG_Hugo_Symbol.
We can also subset MAF using function subsetMaf
##Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933' (Printing just 5 rows for display convenience)
subsetMaf(maf = laml, tsb = c('TCGA.AB.3009', 'TCGA.AB.2933'))[1:5]
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome
## 1: ABCB11 8647 genome.wustl.edu 37 2
## 2: ACSS3 79611 genome.wustl.edu 37 12
## 3: ANK3 288 genome.wustl.edu 37 10
## 4: AP1G2 8906 genome.wustl.edu 37 14
## 5: ARC 23237 genome.wustl.edu 37 8
## Start_Position End_Position Strand Variant_Classification Variant_Type
## 1: 169780250 169780250 + Missense_Mutation SNP
## 2: 81536902 81536902 + Missense_Mutation SNP
## 3: 61926581 61926581 + Splice_Site SNP
## 4: 24033309 24033309 + Missense_Mutation SNP
## 5: 143694874 143694874 + Missense_Mutation SNP
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
## 1: G G A
## 2: C C T
## 3: C C A
## 4: C C T
## 5: C C G
## Tumor_Sample_Barcode Protein_Change i_TumorVAF_WU i_transcript_name
## 1: TCGA.AB.3009 p.A1283V 46.97218 NM_003742.2
## 2: TCGA.AB.3009 p.A266V 47.32510 NM_024560.2
## 3: TCGA.AB.3009 43.99478 NM_020987.2
## 4: TCGA.AB.3009 p.R346Q 47.08000 NM_003917.2
## 5: TCGA.AB.3009 p.W253C 42.96435 NM_015193.3
##Same as above but return output as an MAF object
subsetMaf(maf = laml, tsb = c('TCGA.AB.3009', 'TCGA.AB.2933'), mafObj = TRUE)
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build 37 NA NA
## 2: Center genome.wustl.edu NA NA
## 3: Samples 2 NA NA
## 4: nGenes 34 NA NA
## 5: Frame_Shift_Ins 5 2.5 2.5
## 6: In_Frame_Ins 1 0.5 0.5
## 7: Missense_Mutation 26 13.0 13.0
## 8: Nonsense_Mutation 2 1.0 1.0
## 9: Splice_Site 1 0.5 0.5
## 10: total 35 17.5 17.5
##Select all Splice_Site mutations from DNMT3A and NPM1
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), query = "Variant_Classification == 'Splice_Site'")
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome
## 1: DNMT3A 1788 genome.wustl.edu 37 2
## 2: DNMT3A 1788 genome.wustl.edu 37 2
## 3: DNMT3A 1788 genome.wustl.edu 37 2
## 4: DNMT3A 1788 genome.wustl.edu 37 2
## 5: DNMT3A 1788 genome.wustl.edu 37 2
## 6: DNMT3A 1788 genome.wustl.edu 37 2
## Start_Position End_Position Strand Variant_Classification Variant_Type
## 1: 25459874 25459874 + Splice_Site SNP
## 2: 25467208 25467208 + Splice_Site SNP
## 3: 25467022 25467022 + Splice_Site SNP
## 4: 25459803 25459803 + Splice_Site SNP
## 5: 25464576 25464576 + Splice_Site SNP
## 6: 25469029 25469029 + Splice_Site SNP
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
## 1: C C G
## 2: C C T
## 3: A A G
## 4: A A C
## 5: C C A
## 6: C C A
## Tumor_Sample_Barcode Protein_Change i_TumorVAF_WU i_transcript_name
## 1: TCGA.AB.2818 p.R803S 36.84 NM_022552.3
## 2: TCGA.AB.2822 62.96 NM_022552.3
## 3: TCGA.AB.2891 34.78 NM_022552.3
## 4: TCGA.AB.2898 46.43 NM_022552.3
## 5: TCGA.AB.2934 p.G646V 37.50 NM_022552.3
## 6: TCGA.AB.2968 p.E477* 40.00 NM_022552.3
##Same as above but include only 'i_transcript_name' column in the output
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name')
## Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele
## 1: DNMT3A 2 25459874 25459874 C
## 2: DNMT3A 2 25467208 25467208 C
## 3: DNMT3A 2 25467022 25467022 A
## 4: DNMT3A 2 25459803 25459803 A
## 5: DNMT3A 2 25464576 25464576 C
## 6: DNMT3A 2 25469029 25469029 C
## Tumor_Seq_Allele1 Tumor_Seq_Allele2 Variant_Classification Variant_Type
## 1: C G Splice_Site SNP
## 2: C T Splice_Site SNP
## 3: A G Splice_Site SNP
## 4: A C Splice_Site SNP
## 5: C A Splice_Site SNP
## 6: C A Splice_Site SNP
## Tumor_Sample_Barcode i_transcript_name
## 1: TCGA.AB.2818 NM_022552.3
## 2: TCGA.AB.2822 NM_022552.3
## 3: TCGA.AB.2891 NM_022552.3
## 4: TCGA.AB.2898 NM_022552.3
## 5: TCGA.AB.2934 NM_022552.3
## 6: TCGA.AB.2968 NM_022552.3
maftools has few other functions such as plotVaf
and genesToBarcodes
which helps to plot vaf distributions and maps samples where a given genes are mutated respectively.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.5-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.5-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] corrplot_0.77 maftools_1.2.30 bigmemory_4.5.19
## [4] bigmemory.sri_0.1.3 Biobase_2.36.2 BiocGenerics_0.22.0
## [7] BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-131 bitops_1.0-6
## [3] matrixStats_0.52.2 doParallel_1.0.10
## [5] RColorBrewer_1.1-2 rprojroot_1.2
## [7] prabclus_2.2-6 GenomeInfoDb_1.12.0
## [9] tools_3.4.0 backports_1.1.0
## [11] R6_2.2.1 DBI_0.6-1
## [13] lazyeval_0.2.0 colorspace_1.3-2
## [15] trimcluster_0.1-2 nnet_7.3-12
## [17] GetoptLong_0.1.6 gridExtra_2.2.1
## [19] compiler_3.4.0 DelayedArray_0.2.4
## [21] pkgmaker_0.22 labeling_0.3
## [23] slam_0.1-40 rtracklayer_1.36.2
## [25] diptest_0.75-7 scales_0.4.1
## [27] DEoptimR_1.0-8 mvtnorm_1.0-6
## [29] robustbase_0.92-7 NMF_0.20.6
## [31] stringr_1.2.0 digest_0.6.12
## [33] Rsamtools_1.28.0 rmarkdown_1.5
## [35] cometExactTest_0.1.3 XVector_0.16.0
## [37] htmltools_0.3.6 changepoint_2.2.2
## [39] BSgenome_1.44.0 rlang_0.1.1
## [41] GlobalOptions_0.0.12 RSQLite_1.1-2
## [43] shape_1.4.2 zoo_1.8-0
## [45] mclust_5.3 BiocParallel_1.10.1
## [47] DPpackage_1.1-6 dendextend_1.5.2
## [49] dplyr_0.5.0 VariantAnnotation_1.22.0
## [51] RCurl_1.95-4.8 magrittr_1.5
## [53] modeltools_0.2-21 GenomeInfoDbData_0.99.0
## [55] wordcloud_2.5 Matrix_1.2-10
## [57] Rcpp_0.12.11 munsell_0.4.3
## [59] S4Vectors_0.14.1 viridis_0.4.0
## [61] stringi_1.1.5 whisker_0.3-2
## [63] yaml_2.1.14 MASS_7.3-47
## [65] SummarizedExperiment_1.6.1 zlibbioc_1.22.0
## [67] flexmix_2.3-14 plyr_1.8.4
## [69] grid_3.4.0 ggrepel_0.6.5
## [71] lattice_0.20-35 cowplot_0.7.0
## [73] Biostrings_2.44.0 splines_3.4.0
## [75] GenomicFeatures_1.28.0 circlize_0.3.10
## [77] knitr_1.16 ComplexHeatmap_1.14.0
## [79] GenomicRanges_1.28.2 rjson_0.2.15
## [81] fpc_2.1-10 rngtools_1.2.4
## [83] biomaRt_2.32.0 reshape2_1.4.2
## [85] codetools_0.2-15 stats4_3.4.0
## [87] XML_3.98-1.7 evaluate_0.10
## [89] data.table_1.10.4 foreach_1.4.3
## [91] gtable_0.2.0 kernlab_0.9-25
## [93] assertthat_0.2.0 ggplot2_2.2.1
## [95] gridBase_0.4-7 xtable_1.8-2
## [97] class_7.3-14 survival_2.41-3
## [99] viridisLite_0.2.0 tibble_1.3.1
## [101] iterators_1.0.8 GenomicAlignments_1.12.1
## [103] AnnotationDbi_1.38.0 memoise_1.1.0
## [105] registry_0.3 IRanges_2.10.1
## [107] cluster_2.0.6