Abstract
Heterogeneous cells within a stable state are controlled by gene regulatory networks. Cell-state state transitions are not just gradually and linearly changing phenomena, but rather can be described as nonlinear critical transitions or tipping points (Moris 2016). This abrupt transition between two stable states has been well described using a tipping-point model (Scheffer 2012). When a complex system arrives at a tipping point, there exists fundamental instability in the system. Although this brings a potential risks of unwanted collapse, it also presents an opportunity for positive change through reversal of trajectory. The idea of tipping-point prediction has been previously pursued and implimented, however existing methods each have their limitations. The ‘early-warning’ package in R (http://www.early-warning-signals.org/) is designed for longitudinal data analysis and is applicable to ecosystems, ecology, space, and other fields. However, we found it hard to apply to high-throughput ‘omics’ data. Another published method for biological tipping-point characterization is Dynamic Network Biomarker (DNB) (Chen 2012). This method relies on indexing gene-oscillation per state, and thus is not always feasible for cross-state comparisons and indication of tipping points. Another published method, Index of critical transition (Ic-score), predicts tipping point is limited by its reliance on pre-defined features (Mojtahedi 2016). The purpose of this R package BioTIP is to characterize Biological Tipping-Points from a high-throughput data matrix where rows represent molecular features such as transcripts and columns represent samples or cells in a data matrix. BioTIP implicates one new method and the two aforementioned published methods(DNB and Ic-score) to allow for flexible analysis of not only longitudinal, but also cross-sectional datasets. Besides overcoming the limitations of previous methods, BioTIP provides functions to optimize feature pre-selection and evaluate empirical significance. Additionally, BioTIP defines transcript biotypes according the GENCODE annotation, allowing for study of noncoding transcription (Wang 2018). These improvements allow for an application to cross-sectional datasets. The BioTIP scheme acts as a hybrid model to join the advantages of both DNB and Ic-scoring, and providing enhanced features for high-throughput ‘omics’ data analysis.An existing dataset, GSE6136, is used to demonstrate how our functions are applied. Samples were collected from transgenic mouse lymphomas and divided into five groups based on clinical presentation, pathology and flow cytometry (Lenburg 2007), thus belonging to cross-sectional profiles. Noticing these five group represent a control stage similar to non-transgenic B cells and four major periods of B-cell lymphomagenesis, Dr. Chen and coauthors used the DNB method to identify the pre-disease state exists around the normal activated period (P2), i.e., the system transitions to the disease state around the transitional lymphoma period (Figure S12 in publication (Chen 2012)). Start by installing the package ‘BioTIP’ and other dependent packages such as stringr, psych, and igraph if necessary. Below are some examples.
# load package
library(BioTIP)
Once all the required packages are installed, load using “read.table()”
function as follows. Note to change the read.table(“PATH”) when running your datasets. To check the dimension of your data set use “dim(dataset)” function. Here we show a use of dim function “dim(GSE6136)”. Notice that after editing the column and row, the dimension of our dataset changes from (22690,27) to (22690, 26) because we removed the downloaded first row after assigning it to be column-name of the final numeric data matrix.
data(GSE6136_matrix)
dim(GSE6136_matrix) #[1] 22690rows and 27 columns
#> [1] 22690 27
row.names(GSE6136_matrix) = GSE6136_matrix$ID_REF
GSE6136 = GSE6136_matrix[,-1]
dim(GSE6136) #[1] 22690 rows and 26 columns
#> [1] 22690 26
The summary of GSE6136_matrix is GSE6136_cli shown below. These two kind of data files can be downloaded from GSE database
#requires library(stringr)
library(BioTIP)
data(GSE6136_cli)
#dim(GSE6136_cli) #check dimension
cli = t(GSE6136_cli)
library(stringr)
colnames(cli) = str_split_fixed(cli[1,],'_',2)[,2]
cli = cli[-1,]
cli = data.frame(cli)
cli[,"cell-type:ch1"] = str_split_fixed(cli$characteristics_ch1.1,": ",2)[,2]
cli[,"Ig clonality:ch1"] = str_split_fixed(cli$characteristics_ch1.3,": ",2)[,2]
colnames(cli)[colnames(cli) == "cell-type:ch1"] = "group"
cli$Row.names = cli[,1]
head(cli[,1:3])
#> geo_accession status submission_date
#> V2 GSM142398 Public on Oct 28 2006 Oct 25 2006
#> V3 GSM142399 Public on Oct 28 2006 Oct 25 2006
#> V4 GSM142400 Public on Oct 28 2006 Oct 25 2006
#> V5 GSM142401 Public on Oct 28 2006 Oct 25 2006
#> V6 GSM142402 Public on Oct 28 2006 Oct 25 2006
#> V7 GSM142403 Public on Oct 28 2006 Oct 25 2006
We normalized the expression of genes using log2() scale. This normalization will ensure a more accurate comparison of the variance between the expression groups (clusters).
dat <- GSE6136
df <- log2(dat+1)
head(df)
#> GSM142398 GSM142399 GSM142400 GSM142401 GSM142402 GSM142403
#> 1415670_at 10.219169 10.290480 9.987548 10.076816 9.827343 9.871289
#> 1415671_at 10.903581 11.159934 10.776186 10.929998 11.468268 11.200408
#> 1415672_at 11.115304 10.892087 11.091303 11.040290 11.109504 11.325305
#> 1415673_at 8.990388 10.265615 8.742141 8.422065 8.963474 8.874674
#> 1415674_a_at 9.911692 10.665869 9.942661 9.793766 10.243650 10.147078
#> 1415675_at 9.524933 9.896332 9.590774 9.375474 9.392747 9.422065
#> GSM142404 GSM142405 GSM142406 GSM142407 GSM142408 GSM142409
#> 1415670_at 9.675428 9.950702 9.848153 10.103419 10.040838 9.823367
#> 1415671_at 10.654726 10.875135 10.854245 10.898450 10.736909 10.000000
#> 1415672_at 11.639702 10.989253 11.021605 11.255088 11.278449 11.232960
#> 1415673_at 8.923327 9.214562 8.965496 9.212861 9.121793 9.634993
#> 1415674_a_at 10.133271 10.344962 10.392210 10.551131 9.750205 10.642864
#> 1415675_at 9.396605 9.894666 9.818103 9.794741 9.857981 10.175924
#> GSM142410 GSM142411 GSM142412 GSM142413 GSM142414 GSM142415
#> 1415670_at 9.974271 10.226412 9.471878 9.483413 9.743320 10.126317
#> 1415671_at 10.341185 10.515897 10.486835 11.247394 10.813861 10.714589
#> 1415672_at 11.227135 10.945444 10.904334 11.208173 11.316168 11.251542
#> 1415673_at 9.696794 9.112440 9.050121 9.026523 8.806711 9.242221
#> 1415674_a_at 10.239360 10.351381 10.469744 9.948075 9.760886 10.089980
#> 1415675_at 9.603997 9.392532 9.250062 9.563387 9.597680 9.594325
#> GSM142416 GSM142417 GSM142418 GSM142419 GSM142420 GSM142421
#> 1415670_at 9.654457 9.976134 10.033423 9.712699 10.093286 9.939138
#> 1415671_at 10.429407 10.590681 10.663825 10.401733 10.812819 10.574026
#> 1415672_at 10.418696 11.120108 11.484219 11.799808 11.337064 11.169048
#> 1415673_at 8.240791 10.588715 10.380136 10.484521 10.616273 9.051209
#> 1415674_a_at 9.813941 10.572984 10.664181 10.667023 10.865347 9.748696
#> 1415675_at 8.961739 9.798634 9.638436 9.446670 9.690522 9.389739
#> GSM142422 GSM142423
#> 1415670_at 10.152792 9.838258
#> 1415671_at 10.615905 10.375908
#> 1415672_at 11.469845 11.542500
#> 1415673_at 9.899055 10.382732
#> 1415674_a_at 9.434003 9.690696
#> 1415675_at 9.111657 9.116084
Pre-selection Transcript
Once normalized, we can now classify different stages. The tipping point is within the “activated” state in this case. Here we see the number of samples that are classified into states “activated”, “lymphoma_aggressive”, “lymphoma_marginal”, “lymphoma_transitional” and “resting”. For instance, states “activated” and “resting” contain three and four samples, respectively. All the contents of the data set “test” can be viewed using View(test)
. Each clinical state’s content can be viewed using head(test["stage_name"])
. For instance, head(test[“activated”]) shows contents of the activated state.
tmp <- names(table(cli$group))
samplesL <- split(cli[,1],f = cli$group)
#head(samplesL)
test <- sd_selection(df, samplesL,0.01)
head(test[["activated"]])
#> GSM142399 GSM142422 GSM142423
#> 1415766_at 7.600656 8.778077 9.036723
#> 1415827_a_at 11.002252 13.079218 13.205503
#> 1415904_at 12.229810 5.885086 4.217231
#> 1415985_at 11.786106 10.736656 10.044940
#> 1416034_at 11.417114 13.474682 13.760252
#> 1416071_at 11.194388 10.362273 9.993646
<a name=“>Go to Top
Network Partition
A graphical represetation of genes of interest can be achieved using the functions shown below. The getNetwork
function will obtain an igraph object based on a pearson correlation of test
. This igraphL
object is then run using the getCluster_methods
function classify nodes.
library(BioTIP)
#library(igraph)
library(cluster)
igraphL <- getNetwork(test, fdr = 1)
#> activated:226 nodes
#> lymphoma_aggressive:226 nodes
#> lymphoma_marginal:226 nodes
#> lymphoma_transitional:226 nodes
#> resting:226 nodes
cluster <- getCluster_methods(igraphL)
names(cluster)
#> [1] "activated" "lymphoma_aggressive" "lymphoma_marginal"
#> [4] "lymphoma_transitional" "resting"
head(cluster[[1]])
#> $`1`
#> [1] "1415766_at" "1415827_a_at" "1415904_at" "1415985_at"
#> [5] "1416034_at" "1416071_at" "1416138_at" "1416488_at"
#> [9] "1416629_at" "1416632_at" "1416892_s_at" "1416930_at"
#> [13] "1417185_at" "1417244_a_at" "1417815_a_at" "1417816_s_at"
#> [17] "1417934_at" "1418000_a_at" "1418133_at" "1418191_at"
#> [21] "1418466_at" "1418580_at" "1418687_at" "1418727_at"
#> [25] "1419027_s_at" "1419135_at" "1419170_at" "1419186_a_at"
#> [29] "1419357_at" "1419412_at" "1419538_at" "1419659_s_at"
#> [33] "1419676_at" "1419680_a_at" "1420138_at" "1420503_at"
#> [37] "1420619_a_at" "1421055_at" "1421279_at" "1421377_at"
#> [41] "1421469_a_at" "1421560_at" "1421688_a_at" "1421830_at"
#> [45] "1421832_at" "1421854_at" "1421855_at" "1422032_a_at"
#> [49] "1422302_s_at" "1422709_a_at" "1422885_at" "1423030_at"
#> [53] "1423166_at" "1423200_at" "1423419_at" "1423493_a_at"
#> [57] "1423591_at" "1423747_a_at" "1423826_at" "1423961_at"
#> [61] "1424193_at" "1424522_at" "1424573_at" "1424775_at"
#> [65] "1424990_at" "1425154_a_at" "1425356_at" "1425593_at"
#> [69] "1425650_at" "1425675_s_at" "1425745_a_at" "1425792_a_at"
#> [73] "1426213_at" "1426269_at" "1426562_a_at" "1426713_s_at"
#> [77] "1426794_at" "1426875_s_at" "1427562_a_at" "1428040_at"
#> [81] "1428106_at" "1428306_at" "1428476_a_at" "1428616_at"
#> [85] "1428648_at" "1428820_at" "1431056_a_at" "1431591_s_at"
#> [89] "1434045_at" "1435476_a_at" "1435477_s_at" "1435602_at"
#> [93] "1435652_a_at" "1436032_at" "1436900_x_at" "1436996_x_at"
#> [97] "1437172_x_at" "1437502_x_at" "1437765_at" "1437890_at"
#> [101] "1438095_x_at" "1438385_s_at" "1438548_x_at" "1438932_at"
#> [105] "1438933_x_at" "1448163_at" "1448182_a_at" "1448391_at"
#> [109] "1448403_at" "1448724_at" "1448731_at" "1448957_at"
#> [113] "1449474_a_at" "1449553_at" "1450107_a_at" "1450300_at"
#> [117] "1450484_a_at" "1450490_at" "1450814_a_at" "1450844_at"
#> [121] "1450883_a_at" "1450884_at" "1450903_at" "1450945_at"
#> [125] "1451110_at" "1451143_at" "1451191_at" "1451227_a_at"
#> [129] "1451256_at" "1451321_a_at" "1451335_at" "1451426_at"
#> [133] "1451468_s_at" "1451489_at" "1451924_a_at" "1452162_at"
#> [137] "1452170_at" "1452252_at" "1452532_x_at" "1453360_a_at"
#> [141] "1454021_a_at" "1454626_at" "1454963_at" "1455332_x_at"
#> [145] "1455550_x_at" "1455873_a_at" "1455899_x_at" "1456066_a_at"
#> [149] "1456323_at" "1460379_at" "1460682_s_at"
#>
#> $`2`
#> [1] "1416274_at" "1416573_at" "1417862_at" "1418103_at"
#> [5] "1418525_at" "1418535_at" "1418691_at" "1419415_a_at"
#> [9] "1420667_at" "1422242_at" "1422598_at" "1422898_s_at"
#> [13] "1423854_a_at" "1424310_at" "1425093_at" "1425744_a_at"
#> [17] "1425834_a_at" "1425843_at" "1426444_at" "1430167_a_at"
#> [21] "1432155_at" "1432360_a_at" "1434009_at" "1435317_x_at"
#> [25] "1435561_at" "1436364_x_at" "1448058_s_at" "1449266_at"
#> [29] "1449318_at" "1449657_at" "1450058_at" "1450073_at"
#> [33] "1450552_at" "1450817_at" "1451030_at" "1451530_at"
#> [37] "1451861_at" "1454078_a_at" "1455890_x_at" "1460653_at"
#>
#> $`3`
#> [1] "1416823_a_at" "1417184_s_at" "1417498_at" "1417714_x_at"
#> [5] "1417863_at" "1418113_at" "1419028_at" "1419734_at"
#> [9] "1421044_at" "1421351_at" "1421584_at" "1421848_at"
#> [13] "1421961_a_at" "1422092_at" "1423244_at" "1424086_at"
#> [17] "1424690_at" "1426225_at" "1426614_at" "1426973_at"
#> [21] "1427627_at" "1427828_at" "1428361_x_at" "1429980_x_at"
#> [25] "1435759_at" "1436336_at" "1437684_at" "1438564_at"
#> [29] "1450198_at" "1450227_at" "1451999_at" "1452757_s_at"
#> [33] "1453886_a_at" "1454629_at" "1460667_at"
<a name=“>Go to Top
Identifying Dynamic Network Biomodule
Here, ‘module’ refers to a cluster of network nodes (e.g. transcripts) highly linked (e.g. by correlation). “Biomodule” refers to the module resenting a highest score called “Module-Criticality Index (MCI)” per state.
The following step shows a graph of classified clustered samples for five different stages. MCI score is calculated for each module using the getMCI
function. The getMaxMCImember
function will obtain a list of modules with highest MCI at each stage. Use "head(maxCIms)"
to view the MCI scores calculated. Use plotMaxMCI
function to view the line plot of highest MCI score at each stage.
membersL_noweight <- getMCI(cluster,test,adjust.size = FALSE)
plotBar_MCI(membersL_noweight,ylim = c(0,6))
maxMCIms <- getMaxMCImember(membersL_noweight[[1]],membersL_noweight[[2]],min =10)
names(maxMCIms)
#> [1] "idx" "members"
names(maxMCIms[[1]])
#> [1] "activated" "lymphoma_aggressive" "lymphoma_marginal"
#> [4] "lymphoma_transitional" "resting"
names(maxMCIms[[2]])
#> [1] "activated" "lymphoma_aggressive" "lymphoma_marginal"
#> [4] "lymphoma_transitional" "resting"
head(maxMCIms[['idx']])
#> activated lymphoma_aggressive lymphoma_marginal
#> 3 1 1
#> lymphoma_transitional resting
#> 5 3
head(maxMCIms[['members']][['lymphoma_aggressive']])
#> [1] "1416001_a_at" "1416002_x_at" "1416527_at" "1416754_at"
#> [5] "1416771_at" "1417133_at"
To get the selected statistics of biomodules (the module that has the highest MCI score) of each state, please run the following
biomodules = getMaxStats(membersL_noweight[['members']],maxMCIms[[1]])
maxMCI = getMaxStats(membersL_noweight[['MCI']],maxMCIms[[1]])
head(maxMCI)
#> activated lymphoma_aggressive lymphoma_marginal
#> 3.522785 3.334630 2.387766
#> lymphoma_transitional resting
#> 2.448548 2.886293
maxSD = getMaxStats(membersL_noweight[['sd']],maxMCIms[[1]])
head(maxSD)
#> activated lymphoma_aggressive lymphoma_marginal
#> 2.139577 1.885470 1.295115
#> lymphoma_transitional resting
#> 1.790035 1.469989
To get the biomodule with the highest MCI score among all states, as we call it CTS (Critical Transition Signals), please run the following.
CTS = getCTS(maxMCI, maxMCIms[[2]])
#> Length: 35
Run the following to visualize the trendence of every state represented by the cluster with the highest MCI scores.
par(mar = c(10,5,0,2))
plotMaxMCI(maxMCIms,membersL_noweight[[2]],states = names(samplesL),las = 2)
We then perform simulation for MCI scores based on identified signature size (length(CTS) ) using the simulationMCI
function.Use plot_MCI_simulation
function to visualize the result. This step usually takes 20-30mins, so here to save the time, we picked a small number 3 as the length of the CTS.
simuMCI <- simulationMCI(3,samplesL,df)
plot_MCI_Simulation(maxMCI,simuMCI)
The next step is to calculate an Index of Critical transition (Ic score) of the dataset. First, use the getIc function to calculate the Ic score based on the biomodule previously identified. We use the plotIc function to draw a line plot of the Ic score.
IC <- getIc(df,samplesL,CTS)
par(mar = c(10,5,0,2))
plotIc(IC,las = 2)
Then use the two functions to evaluate two types of empirical significance, respectively. The first function simulation_Ic calculates random Ic-scores by shuffling features (transcripts). Showing in the plot is Ic-score of the identification (red) against its corresponding size-controlled random scores (grey).
simuIC <- simulation_Ic(length(CTS),samplesL,df)
par(mar = c(10,5,0,2))
plot_Ic_Simulation(IC,simuIC,las = 2)
Another function plot_simulation_sample calculates random Ic-scores by shuffling samples and visualizes the results. Showing in the plot is observed Ic-score (red vertical line) comparing to the density of random scores (grey), at the tipping point identified.
simulated_Ic = plot_simulation_sample(df,length(samplesL[['lymphoma_aggressive']]),IC[['lymphoma_aggressive']],CTS)
<a name=“>Go to top
###Transcript Annotation and Biotype### ###################################################
Quick Start
The R function getBiotype is used to group transcripts of interest into 11 biotypes based on GENCODE annotation (Fig 2a). When a query transcript overlaps at least half with a GENCODE transcript on the same strand, this query transcript will inherit the biotype of the GENCODE transcript.
In the previous study conducted, five out of the 11 biotypes showed high protein-coding potential while the others did not (Fig 2b) [4]. We thus concluded these seven other biotypes, including protein-coding antisense RNAs, to be lncRNAs. The remaining coding biotypes in this study included canonic protein coding (CPC), ‘PC_mixed’, and ‘PC_intron’.
First start by loading the required libraries: “GenomeInfoDb,” “BioTIP,” “GenomicRanges,” “IRanges” and “BioTIP”. Next load the datasets: “gencode”, “ILEF”, “intron” and “cod”. Using these datasets, excute BioTIP functions getBiotypes and getReadthrough as follows. These steps assume you installed the “BioTIP” package. If you did not install the package, use the install.packages("BioTIP")
to install in R.
Fig 2. A getBiotypes workflow and protein-coding potential in real data analysis [4]. (a) Workflow of an in-house R function (getBiotypes) to query transcripts of interests and classify into biotypes. (b) Pie-chart of eleven types of transcripts assembled from polyadenylated RNA(TARGET). (c) Empirical cumulative distribution plot comparing the transcripts across all 11 biotypes. The protein-coding potential was estimated with the Coding Potential Assessment Tool (CPAT). Line color codes biotypes. The more a line towards the right-bottom corner, the lower protein-coding potential it has.
library(BioTIP)
data(gencode)
head(gencode)
#> GRanges object with 6 ranges and 1 metadata column:
#> seqnames ranges strand | biotype
#> <Rle> <IRanges> <Rle> | <factor>
#> [1] chr21 9683191-9683272 + | miRNA
#> [2] chr21 9683191-9683272 + | miRNA
#> [3] chr21 9683191-9683272 + | miRNA
#> [4] chr21 9825832-9826011 + | miRNA
#> [5] chr21 9825832-9826011 + | miRNA
#> [6] chr21 9825832-9826011 + | miRNA
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
These illustrations above assumes you have installed “BioTIP” package. If you did not install the package already, use the install.packages("BioTIP")
to install in R.
Genomic Data Source
High quality human genome sequence data can be obtained from various sources. To demonstrate this package, we obtained a comprehensive gene annotation of human GRCh37 from GENCODE. For our illustrations, human GRCh37 data will be used. A standard file structure, similar to general transfer format (gtf) format, is required for this package. This gtf file organizes genomic data in rows and columns (fields). Each row contains information about specific samples. The columns are tab separated headers of the data frame. There are eight fixed columns with specific headers. An example of gtf format is shown below. For details of the gtf file format visit this link.
The table above contains a chr21 dataset which was extracted from a full genome dataset. An extraction method for filtering chr21 from gencode
file is described below.
Extracting Summary Data
Before any further analysis, we need to summarize the content of the raw gtf data. There are two ways to get genome biotypes: a) “transcript_type” b) “gene_type”. Due to our interst in coding and noncoding regions, the transcript_type
method was used to extract the regions of interest using python script shown below. Note that the "PATH_FILE"
refers to the path where the downloded gtf file is located. For instance, if the gtf file is located on your desktop
, replace the "PATH_FILE"
Cc "Users/user/Desktop/gtf"
.
Python codes:
gtf = ("Your/PATH/TO/THE/FILE")
outF = open("gtf_summary_transbiotype.txt","w")
def getquote(str,f,target):
targetLen = len(target)+2
strInd = str.find(target)
st = strInd + len(target)+2
ed = st + str[st:].find("";")
#print(st,ed)
f.write("\t"+str[st:ed]) if strInd!= -1 else f.write("\t"+"NA.")
with open(gtf, "r") as f:
for line in f:
if line[0] != "#":
chromosome = line.split("\t")[0]
st = line.split("\t")[3]
ed = line.split("\t")[4]
strand = line.split("\t")[6]
type = line.split("\t")[2]
outF.write(chromosome+"\t"+st+"\t"+ed+"\t"+strand+"\t"+type)
c = "transcript_id"
g = "gene_name"
t = "transcript_type"
getquote(line,outF,c)
getquote(line,outF,g)
getquote(line,outF,t)
outF.write("\n")
outF.close()
Loading Data
In order to load your data from a local drive, use the following format. Note that the "PATH_FILE"
refers to the location of the summary data from the above section. For more details on how to load datasets click here.
data <- read.delim(“PATH_FILE”, comment.char = “#”)
Internal BioTIP package data is included in the data folder. The data can be loaded into R working console using data()
function. Here we show an example of how to load a dataset gencode
from the data directory. A quick view of the data can be achieved using head(gencode)
.
library(BioTIP)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect,
#> is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
#> pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
#> setdiff, sort, table, tapply, union, unique, unsplit, which,
#> which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#>
#> expand.grid
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
data(gencode)
head(gencode)
#> GRanges object with 6 ranges and 1 metadata column:
#> seqnames ranges strand | biotype
#> <Rle> <IRanges> <Rle> | <factor>
#> [1] chr21 9683191-9683272 + | miRNA
#> [2] chr21 9683191-9683272 + | miRNA
#> [3] chr21 9683191-9683272 + | miRNA
#> [4] chr21 9825832-9826011 + | miRNA
#> [5] chr21 9825832-9826011 + | miRNA
#> [6] chr21 9825832-9826011 + | miRNA
#> -------
#> seqinfo: 25 sequences from an unspecified genome; no seqlengths
Prepare GRanges Object
Here we show an extraction of “gencode” dataset using R commands. Note to replace PATH_FILE
with file direcotry path. gtf
refers to the full genome file. A subset function was used to filter chr21 dataset as follows.
chr21 <- subset(gencode, seqnames == "chr21")
#“genecode” = whole genome gtf
> gtf = read.table("PATH_FILE")
> gtf = subset(gtf, biotype == "transcript")
> colnames(gtf) = c("chr","start","end","strand","biotype")
> gr = GRanges(gtf)
query <- GRanges(c("chr1:2-10:+","chr1:6-10:-"), Row.names = c("trans1","trans2"), score = c(1,2))
head(query)
#> GRanges object with 2 ranges and 2 metadata columns:
#> seqnames ranges strand | Row.names score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] chr1 2-10 + | trans1 1
#> [2] chr1 6-10 - | trans2 2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(BioTIP)
gr <- GRanges(c("chr1:1-5:+","chr1:2-3:+"),biotype = c("lincRNA","CPC"))
head(gr)
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | biotype
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr1 1-5 + | lincRNA
#> [2] chr1 2-3 + | CPC
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Intron coordinates
intron <- GRanges("chr1:6-8:+")
intron <- GRanges("chr1:6-8:+")
head(intron)
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 6-8 +
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Filtering non-coding regions using products from example 1, 2 and 3
intron_trncp <- getBiotypes(query, gr, intron)
intron_trncp
#> seqnames start end width strand Row.names score type.fullOverlap
#> 1 chr1 2 10 9 + trans1 1 de novo
#> 2 chr1 6 10 5 - trans2 2 de novo
#> type.partialOverlap type.50Overlap hasIntron type.toPlot
#> 1 lincRNA, CPC lincRNA, CPC yes lincRNA
#> 2 de novo de novo no de novo
# Filtering Intron and Exons
Here we show how to obtain protein coding and non-coding from our datasets. The coding transcripts are an expressed section of the genome that is responsible for protein formation. Meanwhile the non-coding transcripts are vital in the formation regulatory elements such promoters, enhancers and silencers.
library(BioTIP)
data("intron")
data("ILEF")
data("gencode")
gencode_gr = GRanges(gencode)
ILEF_gr = GRanges(ILEF)
cod_gr = GRanges(cod)
intron_gr = GRanges(intron)
non_coding <- getBiotypes(ILEF_gr, gencode_gr, intron_gr)
#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
#> suppressWarnings() to suppress this warning.)
dim(non_coding)
#> [1] 300 11
head(non_coding[,1:3])
#> seqnames start end
#> 1 chr21 15608524 15710335
#> 2 chr21 43619799 43717938
#> 3 chr21 28208595 28217692
#> 4 chr21 28279059 28339668
#> 5 chr21 46493768 46646483
#> 6 chr21 45285030 45407475
coding <- getBiotypes(ILEF_gr, gencode_gr)
dim(coding)
#> [1] 300 11
head(coding[,1:3])
#> seqnames start end
#> 1 chr21 15608524 15710335
#> 2 chr21 43619799 43717938
#> 3 chr21 28208595 28217692
#> 4 chr21 28279059 28339668
#> 5 chr21 46493768 46646483
#> 6 chr21 45285030 45407475
# Samples with overlapping coding regions.
library(BioTIP)
data(ILEF)
data(cod)
ILEF_gr = GRanges(ILEF)
cod_gr = GRanges(cod)
rdthrough <- getReadthrough(ILEF_gr, cod_gr)
head(rdthrough)
#> seqnames start end width strand Row.names readthrough
#> 1 chr21 15608524 15710335 101812 + ABCC13 0
#> 2 chr21 43619799 43717938 98140 + ABCG1 1
#> 3 chr21 28208595 28217692 9098 - ADAMTS1 1
#> 4 chr21 28279059 28339668 60610 - ADAMTS5 1
#> 5 chr21 46493768 46646483 152716 + ADARB1 1
#> 6 chr21 45285030 45407475 122446 + AGPAT3 1
Acknowledgements
The development of this package would not be possible without continous help and feedback from individuals and institutions including: The Bioconductor Core Team, Dr. Xianan H Yang, Dr. Tzintzuni Garcia, and National Institutes of Health R21LM012619.
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.0
#> [4] S4Vectors_0.24.0 BiocGenerics_0.32.0 cluster_2.1.0
#> [7] stringr_1.4.0 BioTIP_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] igraph_1.2.4.1 Rcpp_1.0.2 XVector_0.26.0
#> [4] knitr_1.25 magrittr_1.5 zlibbioc_1.32.0
#> [7] mnormt_1.5-5 lattice_0.20-38 rlang_0.4.1
#> [10] highr_0.8 tools_3.6.1 grid_3.6.1
#> [13] nlme_3.1-141 xfun_0.10 psych_1.8.12
#> [16] htmltools_0.4.0 yaml_2.2.0 digest_0.6.22
#> [19] GenomeInfoDbData_1.2.2 bitops_1.0-6 RCurl_1.95-4.12
#> [22] evaluate_0.14 rmarkdown_1.16 stringi_1.4.3
#> [25] compiler_3.6.1 foreign_0.8-72 pkgconfig_2.0.3
<a name=“>Go to Top
References