1 Introduction
2 Finding Open Reading Frames
- 2.1 Example of finding ORFs in on 5’ UTR of hg19
- 2.2 Getting fasta sequences of ORFs
3 New GRanges and GRangesList utilities for ORFs
4 When to use which ORFfinding function
5 CageSeq data for 5’ UTR re-annotation
6 RiboSeq footprints automatic shift detection and shifting
7 Gene identity functions for ORFs or genes
8 Calculating Kozak sequence score for ORFs
9 Using ORFik in your package or scripts
10 Coverage plots made easy with ORFik
- 10.1 Multiple data sets in one plot

1 Introduction

Welcome to the ORFik package. This vignette will walk you through our detailed package usage with examples. ORFik is an R package containing various functions for analysis of RiboSeq, RNASeq and CageSeq data.

ORFik currently supports:

Finding Open Reading Frames (very fast) in the genome of interest or on the set of transcripts/sequences.
Automatic estimations of RiboSeq footprint shift.
Utilities for metaplots of RiboSeq coverage over gene START and STOP codons allowing to spot the shift.
Shifting functions for the RiboSeq data.
Finding new Transcription Start Sites with the use of CageSeq data.
Various measurements of gene identity e.g. FLOSS, coverage, ORFscore, entropy that are recreated based on many scientific publications.
Utility functions to extend GenomicRanges for faster grouping, splitting, tiling etc.
Several standardized plots for coverage and metacoverage of NGS data, including smart grouping functions for easier prototyping.

2 Finding Open Reading Frames

In molecular genetics, an Open Reading Frame (ORF) is the part of a reading frame that has the ability to be translated. It does not mean that every ORF is being translated or is functional, but to be able to find novel genes we must be able to firstly identify potential ORFs.

To find all Open Reading Frames (ORFs) and possibly map them to genomic coordinates ORFik gives you three main functions:

findORFs - find ORFs in sequences of interest,
findMapORFs - find ORFs and map them to their respective genomic coordinates
findORFsFasta - find ORFs in Fasta file or BSGenome (supports circular genomes!)

2.1 Example of finding ORFs in on 5’ UTR of hg19

library(ORFik)
library(GenomicFeatures)

After loading libraries, load example data from GenomicFeatures. We load gtf file as txdb. We will extract the 5’ leaders to find all upstream open reading frames.

txdbFile <- system.file("extdata", "hg19_knownGene_sample.sqlite", 
                        package = "GenomicFeatures")
txdb <- loadTxdb(txdbFile)
fiveUTRs <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fiveUTRs[1]

## GRangesList object of length 1:
## $uc001bum.2
## GRanges object with 1 range and 3 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name exon_rank
##          <Rle>         <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 32671236-32671282      + |         1        <NA>         1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

As we can see we have extracted 5’ UTRs for hg19 annotations. Now we can load BSgenome version of human genome (hg19). If you don’t have this package installed you will not see the result from the code below. You might have to install BSgenome.Hsapiens.UCSC.hg19 and run the code for yourself as we don’t install this package together with ORFik.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {

  # Extract sequences of fiveUTRs.
  # Either you import fasta file of ranges, or you have some BSgenome.
  tx_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, 
                                   fiveUTRs) 
  
  # Find all ORFs on those transcripts and get their genomic coordinates
  fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs, groupByTx = FALSE)
  fiveUTR_ORFs
}

## GRangesList object of length 839:
## $uc010ogz.1_1
## GRanges object with 2 ranges and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32671314-32671324      + | uc010ogz.1_1
##   uc010ogz.1     chr1 32671755-32671902      + | uc010ogz.1_1
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## $uc010ogz.1_2
## GRanges object with 1 range and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32672038-32672076      + | uc010ogz.1_2
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## $uc010ogz.1_3
## GRanges object with 2 ranges and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32671237-32671324      + | uc010ogz.1_3
##   uc010ogz.1     chr1 32671755-32671807      + | uc010ogz.1_3
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## ...
## <836 more elements>

In the example above you can see that fiveUTR_ORFs are grouped by transcript, the first group is from transcript “uc010ogz.1”. Meta-column names contains name of the transcript and identifier of the ORF separated by “_“. When ORF is separated into two exons you can see it twice, like the first ORF with name”uc010ogz.1_1“. The first ORF will always be the one most upstream for”+" strand, and the least upstream for “-” strand.

2.2 Getting fasta sequences of ORFs

Now lets see how easy it is to get fasta sequences from the ranges

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  orf_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens,
                                    fiveUTR_ORFs[1])
  # To save as .fasta do:
  # writeXStringSet(orf_seqs, filepath = "uorfs.fasta")
  orf_seqs[1]
}

##   A DNAStringSet instance of length 1
##     width seq                                               names               
## [1]   159 CTGCATTGCAGGCCTGCGTCCGG...GCCGCGATTCCTCCCAGAGGTAG uc010ogz.1_1

You can see orf 1 named (uc010ogz.1_1) has a CTG start codon, a TAG stop codon and 159/3 = 53 codons. We will now look on ORFik functions to get startcodons and stopcodon etc.

3 New GRanges and GRangesList utilities for ORFs

ORFik contains functions that can be utilized to speed up your coding. Check documentations for these functions: sortPerGroup, unlistGrl, strandBool, tile1.

3.1 Grouping ORFs

There are 2 main ways of grouping ORFs. Sometimes you want all ORFs grouped by which transcript they came from, or you might want each ORF as a group in the GRangesList. To do this more easily you can use the function groupGRangesBy.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # The ORFs are now grouped by transcripts, but we want them grouped by ORFs:
  # we use the orfs exon column called ($names) to group, it is made by ORFik.
  unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
  test_ranges <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
  print("Grouped by ORF")
  print(test_ranges[1:2])
  # the orfs are now grouped by orfs. If we want to go back to transcripts we do:
  unlisted_ranges <- unlistGrl(test_ranges)
  test_ranges <- groupGRangesBy(unlisted_ranges) # <- defualt is tx grouping by names
  print("Grouped by Transcript")
  print(test_ranges)
}

## [1] "Grouped by ORF"
## GRangesList object of length 2:
## $uc010ogz.1_1
## GRanges object with 2 ranges and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32671314-32671324      + | uc010ogz.1_1
##   uc010ogz.1     chr1 32671755-32671902      + | uc010ogz.1_1
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## $uc010ogz.1_2
## GRanges object with 1 range and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32672038-32672076      + | uc010ogz.1_2
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## [1] "Grouped by Transcript"
## GRangesList object of length 127:
## $uc010ogz.1
## GRanges object with 10 ranges and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32671314-32671324      + | uc010ogz.1_1
##   uc010ogz.1     chr1 32671755-32671902      + | uc010ogz.1_1
##   uc010ogz.1     chr1 32672038-32672076      + | uc010ogz.1_2
##   uc010ogz.1     chr1 32671237-32671324      + | uc010ogz.1_3
##   uc010ogz.1     chr1 32671755-32671807      + | uc010ogz.1_3
##   uc010ogz.1     chr1 32671934-32672044      + | uc010ogz.1_4
##   uc010ogz.1     chr1 32672048-32672152      + | uc010ogz.1_5
##   uc010ogz.1     chr1 32671274-32671324      + | uc010ogz.1_6
##   uc010ogz.1     chr1 32671755-32671913      + | uc010ogz.1_6
##   uc010ogz.1     chr1 32672034-32672192      + | uc010ogz.1_7
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## ...
## <126 more elements>

3.2 Filtering example

Lets say you found some ORFs, and you want to filter out some of them. ORFik provides several functions for filtering. A problem with the original GenomicRanges container, is that filtering on GRanges objects are much easier than on GRangesList objects, ORFik tries to fix this.

In this example we will filter out all orfs as following:

1. First group GRangesList by ORFs
1. width < 60
1. number of exons < 2
1. strand is negative

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # lets use the fiveUTR_ORFs
  #1. Group by ORFs, if ORFs are grouped by transcripts it would make no sense.
  unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
  ORFs <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
  print(length(ORFs))
  #2. Remove widths < 60
  ORFs <- ORFs[widthPerGroup(ORFs) >= 60]
  print(length(ORFs))
  #3. Keep only ORFs with at least 2 exons
  ORFs <- ORFs[numExonsPerGroup(ORFs) > 1]
  print(length(ORFs))
  
  #4. Keep only positive ORFs
  ORFs <- ORFs[strandPerGroup(ORFs) == "+"]
  # all remaining ORFs where on positive strand, so no change
  length(ORFs)
}

## [1] 839
## [1] 426
## [1] 120

## [1] 120

3.3 ORF interest regions

Specific part of the ORF are usually of interest, like start and stop codons. Here we run an example to show what ORFik can do for you.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # let's use the ORFs from the previous examples
  #1. Find the start and stop sites as GRanges
  startSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE)
  stopSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE)
  
  #2. Lets find the start and stop codons,
  # this takes care of potential 1 base exons etc. 
  starts <- startCodons(fiveUTR_ORFs, is.sorted = TRUE)
  stops <- stopCodons(fiveUTR_ORFs, is.sorted = TRUE)
  print("Start codon ranges:")
  print(starts[1:2])
  
  #3. Lets get the bases of the start and stop codons from the fasta file
  # It's very important to check that ORFs are sorted here, set is.sorted to 
  # FALSE if you are not certain if the exons are sorted.
  txSeqsFromFa(starts, BSgenome.Hsapiens.UCSC.hg19::Hsapiens, is.sorted = TRUE)
  print("Stop codons")
  txSeqsFromFa(stops, BSgenome.Hsapiens.UCSC.hg19::Hsapiens, is.sorted = TRUE)
  }

## [1] "Start codon ranges:"
## GRangesList object of length 2:
## $uc010ogz.1_1
## GRanges object with 1 range and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32671314-32671316      + | uc010ogz.1_1
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## $uc010ogz.1_2
## GRanges object with 1 range and 1 metadata column:
##              seqnames            ranges strand |        names
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   uc010ogz.1     chr1 32672038-32672040      + | uc010ogz.1_2
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## [1] "Stop codons"

##   A DNAStringSet instance of length 839
##       width seq                                             names               
##   [1]     3 TAG                                             uc010ogz.1_1
##   [2]     3 TAG                                             uc010ogz.1_2
##   [3]     3 TGA                                             uc010ogz.1_3
##   [4]     3 TAG                                             uc010ogz.1_4
##   [5]     3 TAG                                             uc010ogz.1_5
##   ...   ... ...
## [835]     3 TAG                                             uc011jox.1_3
## [836]     3 TAG                                             uc011jox.1_4
## [837]     3 TGA                                             uc011jox.1_5
## [838]     3 TGA                                             uc011jox.1_6
## [839]     3 TAG                                             uc011jox.1_7

Many more operations are also supported for manipulation of ORFs

4 When to use which ORFfinding function

ORFik supports multiple ORF finding functions, here we describe their specific use.

Which function you will use depend on which organism the data is from, and specific parameters, like circular or non circular genomes, will you use a transcriptome etc.

There are 4 standard ways of finding ORFs:

1. You have some fasta file of the genome only. (For prokaryotes/circular genomes)
1. You have some fasta file of the genome and a spliced transcriptome annotation. (For eucaryotes with splicing)
1. You have a fasta file of transcripts (eucaryotes or prokaryotes)
1. You have a vector of transcripts preloaded in R.

Let’s start with the simplest case, a vector of preloaded transcripts.

Lets say you have some transcripts and want to find all ORFs on them. findORFs() will give you only 5’ to 3’ direction, so if you want both directions, you can do (for strands in both direction):

  library(Biostrings)
  library(S4Vectors)
  # strand with ORFs in both directions
  seqs <- DNAStringSet("ATGAAATGAAGTAAATCAAAACAT")
  ######################>######################< (< > is direction of ORF)
  
  # positive strands
  pos <- findORFs(seqs, startCodon = "ATG", minimumLength = 0)
  # negative strands
  neg <- findORFs(reverseComplement(seqs),
                  startCodon = "ATG", minimumLength = 0)
  # make GRanges since we want strand information
  pos <- GRanges(pos, strand = "+")
  neg <- GRanges(neg, strand = "-")
  # as GRanges
  res <- c(pos, neg)
  # or merge together and make GRangesList
  res <- split(res, seq.int(1, length(pos) + length(neg)))
  res

## GRangesList object of length 3:
## $`1`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        1       1-9      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $`2`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        1      6-14      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $`3`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        1       1-9      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Remember that these results are in transcript coordinates, sometimes you need to convert them to Genomic coordinates.

4.1 Finding ORFs in spliced transcripts

If you have a genome and a spliced transcriptome annotation, you must use findMapORFs(). It takes care of the potential problem from the last example, that we really want our result in genomic coordinates in the end.

4.2 Prokaryote/Circular Genomes and fasta transcriptomes.

Use findORFsFasta(is.circular = TRUE). Note that findORFsFasta automaticly finds (-) strand ORFs. Since that is normally used for genomes.

4.3 Filter on strand

If you have fasta transcriptomes, you dont want the (-) strand. Since all transcripts are in the direction in the fasta file. If you get both (+/-) strand and only want (+) ORFs, do:

  res[strandBool(res)] # Keep only + stranded ORFs

## GRangesList object of length 2:
## $`1`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        1       1-9      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $`2`
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]        1      6-14      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

See individual functions for more examples.

5 CageSeq data for 5’ UTR re-annotation

In the prerevious example we used the refence annotation of the 5’ UTRs from the package GenomicFeatures. Here we will use advantage of CageSeq data to set new Transcription Start Sites (TSS) and re-annotate 5’ UTRs. This is useful to improve tissue specific transcripts.

# path to example CageSeq data from hg19 heart sample
cageData <- system.file("extdata", "cage-seq-heart.bed.bgz", 
                        package = "ORFik")
# get new Transcription Start Sites using CageSeq dataset
newFiveUTRs <- reassignTSSbyCage(fiveUTRs, cageData)
newFiveUTRs

## GRangesList object of length 150:
## $uc001bum.2
## GRanges object with 1 range and 3 metadata columns:
##              seqnames            ranges strand |   exon_id   exon_name
##                 <Rle>         <IRanges>  <Rle> | <integer> <character>
##   uc001bum.2     chr1 32671236-32671282      + |         1        <NA>
##              exon_rank
##              <integer>
##   uc001bum.2         1
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## $uc009vua.2
## GRanges object with 1 range and 3 metadata columns:
##              seqnames            ranges strand |   exon_id   exon_name
##                 <Rle>         <IRanges>  <Rle> | <integer> <character>
##   uc009vua.2     chr1 32671236-32671282      + |         2        <NA>
##              exon_rank
##              <integer>
##   uc009vua.2         1
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## $uc010ogz.1
## GRanges object with 2 ranges and 3 metadata columns:
##              seqnames            ranges strand |   exon_id   exon_name
##                 <Rle>         <IRanges>  <Rle> | <integer> <character>
##   uc010ogz.1     chr1 32671236-32671324      + |         1        <NA>
##   uc010ogz.1     chr1 32671755-32672223      + |         4        <NA>
##              exon_rank
##              <integer>
##   uc010ogz.1         1
##   uc010ogz.1         2
##   -------
##   seqinfo: 93 sequences from an unspecified genome; no seqlengths
## 
## ...
## <147 more elements>

You will now see that most of the transcription start sites have changed. Depending on the species, regular annotations might be incomplete or not specific enough for your purposes.

NOTE: IF you want to edit the whole txdb / gtf file, use reassignTxDbByCage. And save this to get the new gtf with reannotated leaders by CAGE.

6 RiboSeq footprints automatic shift detection and shifting

In RiboSeq data ribosomal footprints are restricted to their p-site positions and shifted with respect to the shifts visible over the start and stop codons. ORFik has multiple functions for processing of RiboSeq data. We will go through an example processing of RiboSeq data below.

Load example raw RiboSeq footprints (unshifted).

bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik")
footprints <- readBam(bam_file)

Investigate what footprint lengths are present in our data.

table(readWidths(footprints))

## 
##   28   29   30 
## 5547 5576 5526

Lets look at how the reads distribute around the CDS per read length.

For that we need to prepare the transcriptome annotation.

gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik")
txdb <- loadTxdb(gtf_file)
tx <- exonsBy(txdb, by = "tx", use.names = TRUE)
cds <- cdsBy(txdb, by = "tx", use.names = TRUE)
trailers <- threeUTRsByTranscript(txdb, use.names = TRUE)
cds[1]

## GRangesList object of length 1:
## $ENSDART00000032996
## GRanges object with 4 ranges and 3 metadata columns:
##       seqnames            ranges strand |    cds_id    cds_name exon_rank
##          <Rle>         <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr8 24067789-24067843      + |         1        <NA>         2
##   [2]     chr8 24068170-24068247      + |         2        <NA>         3
##   [3]     chr8 24068353-24068449      + |         4        <NA>         4
##   [4]     chr8 24068538-24068661      + |         6        <NA>         5
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Restrict footprints to their 5’ starts (after shifting it will be a p-site).

footprintsGR <- convertToOneBasedRanges(footprints, addSizeColumn = TRUE)
footprintsGR

## GRanges object with 16649 ranges and 1 metadata column:
##           seqnames    ranges strand |      size
##              <Rle> <IRanges>  <Rle> | <integer>
##       [1]    chr23  17599156      - |        28
##       [2]    chr23  17599156      - |        28
##       [3]    chr23  17599156      - |        28
##       [4]    chr23  17599156      - |        28
##       [5]    chr23  17599156      - |        28
##       ...      ...       ...    ... .       ...
##   [16645]     chr8  24068894      + |        29
##   [16646]     chr8  24068907      + |        28
##   [16647]     chr8  24068919      + |        30
##   [16648]     chr8  24068919      + |        30
##   [16649]     chr8  24068939      + |        30
##   -------
##   seqinfo: 1133 sequences from an unspecified genome

In the figure below we see why we need to p-shift, see that per length the start of the read are in different positions relative to the CDS start site. The reads create a ladder going downwards, left to right. (see the blue steps)

  hitMap <- windowPerReadLength(cds, tx,  footprintsGR, pShifted = FALSE)
  coverageHeatMap(hitMap, scoring = "transcriptNormalized")

Now lets see how we can p-shift the reads, we will go into detail how this is done. If you just want to run the function, without too much details, skip down to after the 2 comming bar plots.

For the sake of this example we will focus only on most abundant length of 29.

footprints <- footprints[readWidths(footprints) == 29]
footprintsGR <- footprintsGR[readWidths(footprintsGR) == 29]
footprints

## GAlignments object with 5576 alignments and 0 metadata columns:
##          seqnames strand       cigar    qwidth     start       end     width
##             <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##      [1]    chr23      -         29M        29  17599129  17599157        29
##      [2]    chr23      -         29M        29  17599129  17599157        29
##      [3]    chr23      -         29M        29  17599129  17599157        29
##      [4]    chr23      -         29M        29  17599129  17599157        29
##      [5]    chr23      -         29M        29  17599129  17599157        29
##      ...      ...    ...         ...       ...       ...       ...       ...
##   [5572]     chr8      +         29M        29  24068755  24068783        29
##   [5573]     chr8      +         29M        29  24068755  24068783        29
##   [5574]     chr8      +         29M        29  24068769  24068797        29
##   [5575]     chr8      +         29M        29  24068802  24068830        29
##   [5576]     chr8      +         29M        29  24068894  24068922        29
##              njunc
##          <integer>
##      [1]         0
##      [2]         0
##      [3]         0
##      [4]         0
##      [5]         0
##      ...       ...
##   [5572]         0
##   [5573]         0
##   [5574]         0
##   [5575]         0
##   [5576]         0
##   -------
##   seqinfo: 1133 sequences from an unspecified genome

Filter the cds annotation to only those that have some minimum trailer and leader lengths, as well as cds. Then get start and stop codons with extra window of 30bp around them.

txNames <- filterTranscripts(txdb) # <- get only transcripts that pass filter
tx <- tx[txNames]; cds <- cds[txNames]; trailers <- trailers[txNames];
windowsStart <- startRegion(cds[txNames], tx, TRUE, upstream = 30, 29)
windowsStop <- startRegion(trailers, tx, TRUE, upstream = 30, 29)
windowsStart

## GRangesList object of length 2:
## $ENSDART00000000070
## GRanges object with 1 range and 0 metadata columns:
##                      seqnames            ranges strand
##                         <Rle>         <IRanges>  <Rle>
##   ENSDART00000000070    chr24 22711351-22711410      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
## 
## $ENSDART00000032996
## GRanges object with 1 range and 0 metadata columns:
##                      seqnames            ranges strand
##                         <Rle>         <IRanges>  <Rle>
##   ENSDART00000032996     chr8 24067759-24067818      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Calculate meta-coverage over start and stop windowed regions.

hitMapStart <- metaWindow(footprintsGR, windowsStart, withFrames = TRUE)
hitMapStop <- metaWindow(footprintsGR, windowsStop, withFrames = TRUE)

Plot start/stop windows for length 29.

  pSitePlot(hitMapStart)

  pSitePlot(hitMapStop, region = "stop")

From these shifts ORFik uses a fourior transform to detect signal change needed to scale all read lengths of Ribo-seq to the start of the meta-cds.

We can also use automatic detection of RiboSeq shifts using the code below. As we can see reasonable conclusion from the plots would be to shift length 29 by 12, it is in agreement with the automatic detection of the offsets.

shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE)
shifts

##    fraction offsets_start offsets_stop
## 1:       29           -12          -12

Fortunately ORFik has function that can be used to shift footprints using desired shifts. See documentation for more details.

shiftedFootprints <- shiftFootprints(footprints, shifts)

## Sorting shifted footprints...

shiftedFootprints

## GRanges object with 5576 ranges and 1 metadata column:
##          seqnames    ranges strand |      size
##             <Rle> <IRanges>  <Rle> | <integer>
##      [1]     chr8  24066297      + |        29
##      [2]     chr8  24066297      + |        29
##      [3]     chr8  24066297      + |        29
##      [4]     chr8  24066297      + |        29
##      [5]     chr8  24066330      + |        29
##      ...      ...       ...    ... .       ...
##   [5572]    chr24  22711491      - |        29
##   [5573]    chr24  22711503      - |        29
##   [5574]    chr24  22711503      - |        29
##   [5575]    chr24  22711503      - |        29
##   [5576]    chr24  22711503      - |        29
##   -------
##   seqinfo: 1133 sequences from an unspecified genome; no seqlengths

7 Gene identity functions for ORFs or genes

ORFik contains functions of gene identity that can be used to predict which ORFs are potentially coding and functional.

There are 2 main categories:

Sequence features (kozak, gc-content, etc.)
Read features (reads as: Ribo-seq, RNA-seq, TCP-seq, shape-seq etc)
FLOSS floss
coverage coverage
ORFscore orfScore
entropy entropy
translational effiency translationalEff
inside outside score insideOutsideScore
distance between orfs and cds’ distToCds
other

All of the features are implemented based on scientific article published in peer reviewed journal. ORFik supports seemingles calculation of all available features. See example below.

 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  library(GenomicFeatures)

  # Extract sequences of fiveUTRs.
  fiveUTRs <- fiveUTRs[1:10]
  faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
  tx_seqs <- extractTranscriptSeqs(faFile, fiveUTRs)

  # Find all ORFs on those transcripts and get their genomic coordinates
  fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs)
  unlistedORFs <- unlistGrl(fiveUTR_ORFs)
  # group GRanges by ORFs instead of Transcripts, use 4 first ORFs
  fiveUTR_ORFs <- groupGRangesBy(unlistedORFs, unlistedORFs$names)[1:4]

  # make some toy ribo seq and rna seq data
  starts <- unlist(ORFik:::firstExonPerGroup(fiveUTR_ORFs), use.names = FALSE)
  RFP <- promoters(starts, upstream = 0, downstream = 1)
  score(RFP) <- rep(29, length(RFP)) # the original read widths

  # set RNA seq to duplicate transcripts
  RNA <- unlist(exonsBy(txdb, by = "tx", use.names = TRUE), use.names = TRUE)
  # transcript database
  txdb <- loadTxdb(txdbFile)
  dt <- computeFeatures(fiveUTR_ORFs, RFP, RNA, txdb, faFile, 
                        orfFeatures =  TRUE)
  dt
}

##    countRFP  te fpkmRFP fpkmRNA floss entropyRFP disengagementScores       RRS
## 1:        1 Inf 1572327       0     0  0.0000000           0.6666667  7.610063
## 2:        1 Inf 6410256       0     0  0.0000000           2.0000000 31.025641
## 3:        2 Inf 3546099       0     0  0.1800313           1.0000000 12.872340
## 4:        2 Inf 4504505       0     0  0.1919587           3.0000000 16.351351
##         RSS fractionLengths ORFScores ioScore     kozak distORFCDS inFrameCDS
## 1: 26.50000      0.07022968  5.882643    0.40 0.3390461        322          0
## 2:  6.50000      0.01722615  5.882643    0.40 0.1949422        148          0
## 3: 15.66667      0.06227915 -4.906891    0.75 0.0000000        417          2
## 4: 12.33333      0.04902827 -4.906891    0.75 0.7079892        180          2
##    isOverlappingCds rankInTx
## 1:            FALSE        1
## 2:            FALSE        2
## 3:            FALSE        3
## 4:            FALSE        4

You will now get a data.table with one column per score, the columns are named after the different scores, you can now go further with prediction, or making plots.

8 Calculating Kozak sequence score for ORFs

Instead of getting all features, we can also extract single features.

To understand how strong the binding affinitity of an ORF promoter region might be, we can use kozak sequence score. The kozak functions supports several species. In the first example we use human kozak sequence, then we make a self defined kozak sequence.

  # In this example we will find kozak score of cds' 

  if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {

    cds <- cdsBy(txdb, by = "tx", use.names = TRUE)[1:10]
    tx <- exonsBy(txdb, by = "tx", use.names = TRUE)[names(cds)]
    faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens

    kozakSequenceScore(cds, tx, faFile, species = "human")

    # A few species are pre supported, if not, make your own input pfm.

    # here is an example where the human pfm is sent in again, even though
    # it is already supported.

    pfm <- t(matrix(as.integer(c(29,26,28,26,22,35,62,39,28,24,27,17,
                                 21,26,24,16,28,32,5,23,35,12,42,21,
                                 25,24,22,33,22,19,28,17,27,47,16,34,
                                 25,24,26,25,28,14,5,21,10,17,15,28)),
                    ncol = 4))

   kozakSequenceScore(cds, tx, faFile, species = pfm)

  }

##  [1] 0.5531961 0.5531961 0.6652532 0.6925763 0.6370870 0.6370870 0.4854677
##  [8] 0.4706279 0.6381237 0.6529909

9 Using ORFik in your package or scripts

The focus of ORFik for development is to be a swiss army knife for transcriptomics. If you need functions for splicing, getting windows of exons per transcript, periodic windows of exons, speicific parts of exons etc, ORFik can help you with this.

Let’s do an example where ORFik shines. Objective: We have three transcripts, we also have a library of Ribo-seq. This library was treated with cyclohexamide, so we know Ribo-seq reads can stack up close to the stop codon of the CDS. Lets say we only want to keep transcripts, where the cds stop region (defined as last 9 bases of cds), has maximum 33% of the reads. To only keep transcripts with a good spread of reads over the CDS. How would you make this filter ?

  # First make som toy example
  cds <- GRanges("chr1", IRanges(c(1, 10, 20, 30, 40, 50, 60, 70, 80),
                                 c(5, 15, 25, 35, 45, 55, 65, 75, 85)),
                 "+")
  names(cds) <- c(rep("tx1", 3), rep("tx2", 3), rep("tx3", 3))
  cds <- groupGRangesBy(cds)
  ribo <- GRanges("chr1", c(1, rep.int(23, 4), 30, 34, 34, 43, 60, 64, 71, 74),
                  "+")
  # We could do a simplification and use the ORFik entropy function
  entropy(cds, ribo) # <- spread of reads

## [1] 0.3270264 0.5802792 0.7737056

We see that ORF 1, has a low(bad) entropy, but we do not know where the reads are stacked up. So lets make a new filter by using more ORFiks utility functions:

tile <- tile1(cds, FALSE, FALSE) # tile them to 1 based positions
tails <- tails(tile, 9) # get 9 last bases per cds
stopOverlap <- countOverlaps(tails, ribo)
allOverlap <- countOverlaps(cds, ribo)
fractions <- (stopOverlap + 1) / (allOverlap + 1) # pseudocount 1
cdsToRemove <- fractions > 1 / 2 # filter with pseudocounts (1+1)/(3+1) 
cdsToRemove

##   tx1   tx2   tx3 
##  TRUE FALSE FALSE

We now easily made a stop codon filter for our coding sequences.

10 Coverage plots made easy with ORFik

In investigation of ORFs or other interest regions, ORFik can help you make some coverage plots from reads of Ribo-seq, RNA-seq, CAGE-seq, TCP-seq etc.

Lets make 3 plots of Ribo-seq focused on CDS regions.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # Load data as shown before and pshift the Ribo-seq
  # Get the annotation
  txdb <- loadTxdb(gtf_file)
  # Lets take all valid transcripts, with size restrictions:
  # leader > 100 bases, cds > 100 bases, trailer > 100 bases
  txNames <- filterTranscripts(txdb, 100, 100, 100) # valid transcripts
  leaders = fiveUTRsByTranscript(txdb, use.names = TRUE)[txNames]
  cds <- cdsBy(txdb, "tx", use.names = TRUE)[txNames]
  trailers = threeUTRsByTranscript(txdb, use.names = TRUE)[txNames]
  tx <- exonsBy(txdb, by = "tx", use.names = TRUE)
  # Ribo-seq
  bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik")
  reads <- readGAlignments(bam_file)
  shiftedReads <- shiftFootprints(reads, detectRibosomeShifts(reads, txdb))  
}

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  library(data.table)
  # Create meta coverage per part of transcript
  leaderCov <- metaWindow(shiftedReads, leaders, scoring = NULL, 
                          feature = "leaders")

  cdsCov <- metaWindow(shiftedReads, cds, scoring = NULL, 
                       feature = "cds")
  
  trailerCov <- metaWindow(shiftedReads, trailers, scoring = NULL, 
                           feature = "trailers")
  # bind together
  dt <- rbindlist(list(leaderCov, cdsCov, trailerCov))
  # Now set info column
  dt[, `:=` (fraction = "Ribo-seq")]
  # NOTE: All of this is done in one line in function: windowPerTranscript
  
  # zscore gives shape, a good starting plot
  windowCoveragePlot(dt, scoring = "zscore", title = "Ribo-seq metaplot") 
}

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:GenomicAlignments':
## 
##     first, last, second

## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift

## The following object is masked from 'package:GenomicRanges':
## 
##     shift

## The following object is masked from 'package:IRanges':
## 
##     shift

## The following objects are masked from 'package:S4Vectors':
## 
##     first, second

Z-score is good at showing overall shape. You see from the windows each region; leader, cds and trailer is scaled to 100. Lets use a median scoring to find median counts per meta window per positions.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  windowCoveragePlot(dt, scoring = "median", title = "Ribo-seq metaplot") 
}

We see a big spike close to start of CDS, called the TIS. The median counts by transcript is close to 50 here. Lets look at the TIS region using the pshifting plot, seperated into the 3 frames.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # size 100 window: 50 upstream, 49 downstream of TIS
  windowsStart <- startRegion(cds, tx, TRUE, upstream = 50, 49)
  hitMapStart <- metaWindow(shiftedReads, windowsStart, withFrames = TRUE)
  pSitePlot(hitMapStart, length = "meta coverage")
}

Since these reads are p-shifted it is not that unexpected that the maximum number of reads are on the 0 position. We also see a clear pattern in the Ribo-seq.

To see how the different read lengths distribute over the region, we make a heatmap. Where the colors represent the zscore of counts per position.

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # size 25 window (default): 5 upstream, 20 downstream of TIS
  hitMap <- windowPerReadLength(cds, tx,  shiftedReads)
  coverageHeatMap(hitMap, addFracPlot = TRUE)
}

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

In the heatmap you can see that read length 30 has the strongest peak on the TIS, while read length 28 has some reads in the leaders (the minus positions).

10.1 Multiple data sets in one plot

Often you have multiple data sets you want to compare (like ribo-seq).

ORFik has an extensive syntax for automatic grouping of data sets in plots.

The protocol is: 1. Load all data sets 2. Create a merged coverage data.table 3. Pass it into the plot you want.

Here is an easy example:

if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
  # Load more files like above (Here I make sampled data from earlier Ribo-seq)
  dt2 <- copy(dt)
  dt2[, `:=` (fraction = "Ribo-seq2")]
  dt2$score <- dt2$score + sample(seq(-40, 40), nrow(dt2), replace = TRUE)
  
  dtl <- rbindlist(list(dt, dt2))
  windowCoveragePlot(dtl, scoring = "median", title = "Ribo-seq metaplots") 
}

You see that the fraction column is what seperates the rows. You can have unlimited datasets joined in this way.

Our hope is that by using ORFik, we can simplify your analysis when you focus on ORFs / transcript features and especially in combination with sequence libraries like RNA-seq and Ribo-seq etc.

ORFik Overview

6 February 2020

Package

Contents