Contents

1 Overview

The “Spaced Words Projection (SWeeP)” is a method for representing biological sequences in compact vectors. The sequences are scanned by a reading mask that corresponds to a given k-mer. This reading generates a high-dimensional vector (HDV) which, by random projection, is reduced to a low-dimensional vector (LDV), preserving the relative distances between the high-dimensional vectors (preserving the information).

The SWeeP function is the original implementation of the method as in Pierri et al. (2019), which uses a random orthonormal basis generated by the orthBase function. The SWeePlite function is optimized for processing with larger masks (larger k-mer) and for larger volumes of data (long sequences or large numbers of sequences).

These functions are suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The MATLAB version of the method is available at sWeeP (PIERRI, 2019). This tool has it’s main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.

Tutorials and more information are available at https://aibialab.github.io/rSWeeP.

1.1 Functions

The package has four functions:

  • SWeeP: conventional SWeeP function, in which it is necessary to generate or load an orthonormal basis from a file.
  • orthBase: function to generate the orthonormal matrix to be used by SWeeP in the projection.
  • SWeePlite: SWeeP function with built-in orthonormal basis. Slower function but low RAM usage.
  • extractHDV: extracts only the High Dimensional Vector (HDV) from biological sequences.

The SWeeP and SWeePlite functions are responsible for vectorizing the sequences using the referenced SWeeP method (PIERRI, 2019).

2 Quick Start

The page https://aibialab.github.io/rSWeeP provides basic tutorials on using and parameterizing the rSWeeP package. Below you can find a quickstart of the package.

Consider a set of 13 mitochondrial proteomes (translated CDSs) deposited in the folder at the address path in FASTA format. These sequences can be vectorized as below:

library(rSWeeP)
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeePlite(path,seqtype='AA',mask=c(4),psz=1000)
## Starting projection. Please wait.
## starting sequence  1 of 14 - complete
## starting sequence  2 of 14 - complete
## starting sequence  3 of 14 - complete
## starting sequence  4 of 14 - complete
## starting sequence  5 of 14 - complete
## starting sequence  6 of 14 - complete
## starting sequence  7 of 14 - complete
## starting sequence  8 of 14 - complete
## starting sequence  9 of 14 - complete
## starting sequence  10 of 14 - complete
## starting sequence  11 of 14 - complete
## starting sequence  12 of 14 - complete
## starting sequence  13 of 14 - complete
## starting sequence  14 of 14 - complete

In sw$proj you will find the SWeeP vectors in matrix format with 13 rows (each row a sample) and 1000 columns (coordinates). In sw$info other processing information is stored, which may be important in subsequent steps.

sw$info
## $headers
##  [1] "01_Pan_troglodytes"               "02_Capra_aegagrus"               
##  [3] "03_Homo_sapiens"                  "04_Bos_taurus"                   
##  [5] "05_Ara_ararauna"                  "06_Mus_musculus"                 
##  [7] "07_Brotogeris_cyanoptera"         "08_Homo_sapiens_neanderthalensis"
##  [9] "09_Gazella_gazella"               "10_Rattus_norvegicus"            
## [11] "11_Pan_paniscus"                  "12_Psittacara_rubritorquis"      
## [13] "13_Apodemus_sylvaticus"           "14_Rhazya_stricta"               
## 
## $ProjectionSize
## [1] 1000
## 
## $bin
## [1] "counting (FALSE)"
## 
## $mask
## [1] 1 1 1 1
## 
## $SequenceType
## [1] "AA"
## 
## $extension
## [1] ""
## 
## $version
## [1] '1.17.2'
## 
## $norm
## [1] "none"
## 
## $timeElapsed
## [1] 7.765

We can obtain the phylogenetic relationship between the vectorized organisms using the Neighbour Joining (NJ) method.

library(ape)
## 
## Attaching package: 'ape'
## The following object is masked from 'package:Biostrings':
## 
##     complement
# get the distance matrix
mdist = dist(sw$proj,method='euclidean')

# use the NJ algorithm to build the tree
tr = nj(mdist)
# root the tree in the plant sample
tr = root(tr,outgroup='14_Rhazya_stricta')

# plot
plot(tr)

To visualize the vectorized data graphically, we provide the metadata with the classes at the family taxonomic level.

pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv")
mt = read.csv(pathmetadata,header=TRUE)

We obtain the PCA and visualize the first components.

pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1],pca_output$x[,2],xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = mt$id)
legend("bottomright",unique(mt$family),col=as.character(c(1:length(unique(mt$family)))),pch=20)
plot(pca_output$x[,3],pca_output$x[,4],xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = mt$id)

3 Session information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ape_5.8             rSWeeP_1.17.2       Biostrings_2.73.1  
##  [4] GenomeInfoDb_1.41.1 XVector_0.45.0      IRanges_2.39.2     
##  [7] S4Vectors_0.43.2    BiocGenerics_0.51.0 doParallel_1.0.17  
## [10] iterators_1.0.14    foreach_1.5.2       BiocStyle_2.33.1   
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.8          highr_0.11              compiler_4.4.1         
##  [4] BiocManager_1.30.23     crayon_1.5.3            tinytex_0.52           
##  [7] Rcpp_1.0.13             magick_2.8.4            jquerylib_0.1.4        
## [10] yaml_2.3.10             fastmap_1.2.0           lattice_0.22-6         
## [13] R6_2.5.1                knitr_1.48              bookdown_0.40          
## [16] GenomeInfoDbData_1.2.12 bslib_0.8.0             rlang_1.1.4            
## [19] cachem_1.1.0            stringi_1.8.4           xfun_0.46              
## [22] sass_0.4.9              cli_3.6.3               magrittr_2.0.3         
## [25] zlibbioc_1.51.1         grid_4.4.1              digest_0.6.36          
## [28] nlme_3.1-165            lifecycle_1.0.4         evaluate_0.24.0        
## [31] codetools_0.2-20        rmarkdown_2.27          httr_1.4.7             
## [34] tools_4.4.1             htmltools_0.5.8.1       UCSC.utils_1.1.0

4 References

Appendix