BiocNeighbors 1.8.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 8584 3942 1711 2359 4200 6331 561 4329 5708 3803
## [2,] 7924 5792 8419 3740 920 5103 5852 7777 7011 5669
## [3,] 9756 2517 4308 6564 822 8126 4162 497 8632 2656
## [4,] 2063 1679 5260 9437 3510 3585 7202 9278 716 5452
## [5,] 6947 5654 9142 4882 8038 2318 7935 8963 396 8464
## [6,] 3856 1723 2288 6001 7079 9658 8187 1679 4700 9871
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9785542 0.9937828 0.9974682 1.0311086 1.0537837 1.0600865 1.0702299
## [2,] 0.9445717 0.9981147 1.0318105 1.0446899 1.0486889 1.0545481 1.0545697
## [3,] 0.8897178 0.9314029 0.9426289 1.0321267 1.0487431 1.0595485 1.0612212
## [4,] 0.8848239 0.9776480 0.9812904 0.9933047 1.0061167 1.0202012 1.0442240
## [5,] 0.8590254 1.0091568 1.0164609 1.0269836 1.0304595 1.0316750 1.0341054
## [6,] 0.8624543 0.8903279 0.9003656 0.9090142 0.9092635 0.9480212 0.9584342
## [,8] [,9] [,10]
## [1,] 1.0708619 1.0742493 1.0764540
## [2,] 1.0652611 1.0881760 1.0938987
## [3,] 1.0615385 1.0625939 1.0665069
## [4,] 1.0445368 1.0677571 1.0711876
## [5,] 1.0761930 1.0767903 1.0863526
## [6,] 0.9694126 0.9704188 0.9780809
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9618 115 9893 3345 540
## [2,] 6628 9246 8889 9068 9166
## [3,] 4498 791 2105 113 2994
## [4,] 5805 5270 7929 179 9037
## [5,] 4913 4837 5209 4451 4154
## [6,] 3726 6298 8561 5285 1615
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9300495 0.9845788 1.0282300 1.0416820 1.0553656
## [2,] 0.9036013 0.9143441 0.9365810 0.9747946 0.9785217
## [3,] 0.8599398 0.9458985 0.9963076 1.0066876 1.0082071
## [4,] 0.9643821 0.9886798 1.0625366 1.0658094 1.0951591
## [5,] 0.8982796 0.9959829 1.0505439 1.0590628 1.0673671
## [6,] 0.9704359 0.9718943 1.0704929 1.0744665 1.1003007
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpbwZrPR/file5410fe6e783.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.8.0 knitr_1.30 BiocStyle_2.18.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 bookdown_0.21 lattice_0.20-41
## [4] digest_0.6.27 grid_4.0.3 stats4_4.0.3
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.8
## [10] stringi_1.5.3 S4Vectors_0.28.0 Matrix_1.2-18
## [13] rmarkdown_2.5 BiocParallel_1.24.0 tools_4.0.3
## [16] stringr_1.4.0 parallel_4.0.3 xfun_0.18
## [19] yaml_2.2.1 compiler_4.0.3 BiocGenerics_0.36.0
## [22] BiocManager_1.30.10 htmltools_0.5.0