BiocNeighbors 1.23.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9029 8027 2917 3095 1741 993 46 9150 6089 9519
## [2,] 1547 4800 6860 2301 5008 6184 3439 506 7117 5209
## [3,] 9602 6522 6694 2036 4976 6993 660 467 3733 1447
## [4,] 7477 4276 1108 9859 5371 9573 6651 768 152 8384
## [5,] 6578 2381 3130 969 8100 9944 8742 2739 7167 2018
## [6,] 9823 4219 8196 1573 2633 9070 6223 1259 2742 317
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9087563 0.9343576 0.9347140 1.0349810 1.0577810 1.0673149 1.1129901
## [2,] 0.8973757 0.9503361 0.9542246 0.9850744 1.0054473 1.0234799 1.0377213
## [3,] 0.9285746 0.9432272 0.9434503 0.9592357 0.9711123 0.9745569 0.9797416
## [4,] 0.7483439 0.9228959 0.9531658 0.9778813 1.0250983 1.0392928 1.0474914
## [5,] 0.9112917 0.9257460 0.9629084 0.9707109 0.9809361 1.0048245 1.0293660
## [6,] 0.9647995 0.9920351 1.0091248 1.0317351 1.0323133 1.0497558 1.0500126
## [,8] [,9] [,10]
## [1,] 1.1133852 1.1186256 1.1307123
## [2,] 1.0397594 1.0435059 1.0454148
## [3,] 0.9853271 0.9886528 0.9957427
## [4,] 1.0547801 1.0624410 1.0645828
## [5,] 1.0814184 1.0824766 1.0932744
## [6,] 1.0612317 1.0615163 1.0632228
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3663 6964 7208 8609 2757
## [2,] 5938 4834 5184 4502 1547
## [3,] 7779 2441 8523 5783 1266
## [4,] 1266 650 6017 5571 3787
## [5,] 663 3275 2695 3148 2146
## [6,] 4350 7500 102 6155 1944
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8507525 0.8554255 0.9079689 0.9630135 0.9758449
## [2,] 0.7649354 0.9067363 0.9191303 0.9256787 0.9549022
## [3,] 0.8744318 0.8982485 0.9080981 0.9415722 0.9811068
## [4,] 0.9711310 0.9856623 1.0184327 1.0205659 1.0536278
## [5,] 0.8746908 0.8758898 1.0192007 1.0312262 1.0363630
## [6,] 1.0116752 1.0203203 1.0655837 1.0705638 1.0816413
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmp0PrimQ/fileee1c1cf2cbe2.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 Patched (2024-04-24 r86482)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.6.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.23.0 knitr_1.46 BiocStyle_2.33.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.43.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.39.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.51.0 cachem_1.0.8