Lossless Compression of Vector IDs Boosts Approximate Nearest Neighbor Search
2025-01-23

Researchers introduce a lossless compression scheme for vector IDs to address the high storage cost of indexes in approximate nearest neighbor search. Leveraging the fact that the order of IDs is irrelevant in many index structures, and utilizing asymmetric numeral systems or wavelet trees, the method achieves up to 7x compression of vector IDs without impacting accuracy or search runtime. This translates to a 30% reduction in index size for billion-scale datasets. Furthermore, the approach can also losslessly compress quantized vector codes by exploiting sub-optimalities in the original quantization algorithm.