PgDog: Open-Source Sharding for pgvector

2025-03-26
PgDog: Open-Source Sharding for pgvector

Scaling pgvector beyond a million embeddings becomes challenging due to slow index building. This post introduces PgDog, an open-source project that shards the pgvector index. Leveraging IVFFlat's inherent clustering, PgDog distributes vector space partitions across multiple machines. Query vectors are routed to appropriate shards based on proximity to centroids, calculated using scikit-learn, significantly improving search speed and recall. The implementation details cover centroid calculation, a custom sharding function, and SQL parsing using pg_query. Experiments demonstrate PgDog's effectiveness, offering optimizations like parallel cross-shard queries and refined centroid allocation. Future work includes supporting more distance algorithms and SIMD instructions for faster calculations.

Development sharding