The k-nearest-neighbors (kNN) graph is a popular and powerful data structure that is used in various areas of Data Science, but the high computational cost of obtaining it hinders its use on large datasets. Approximate solutions have been described in the literature using diverse techniques, among which Locality-sensitive Hashing (LSH) is a promising alternative that still has unsolved problems. We present Variable Resolution Locality-sensitive Hashing, an algorithm that addresses these problems to obtain an approximate kNN graph at a significantly reduced computational cost. Its usability is greatly enhanced by its capacity to automatically find adequate hyperparameter values, a common hindrance to LSH-based methods. Moreover, we provide an implementation in the distributed computing framework Apache Spark that takes advantage of the structure of the algorithm to efficiently distribute the computational load across multiple machines, enabling practitioners to apply this solution to very large datasets. Experimental results show that our method offers significant improvements over the state-of-the-art in the field and shows very good scalability as more machines are added to the computation.
Keywords: Machine Learningm, Distributed programming, kNN, Locality-sensitive Hashing