PgDog: Open-Source Sharding for pgvector

2025-03-26
PgDog: Open-Source Sharding for pgvector

Scaling pgvector beyond a million embeddings becomes challenging due to slow index building. This post introduces PgDog, an open-source project that shards the pgvector index. Leveraging IVFFlat's inherent clustering, PgDog distributes vector space partitions across multiple machines. Query vectors are routed to appropriate shards based on proximity to centroids, calculated using scikit-learn, significantly improving search speed and recall. The implementation details cover centroid calculation, a custom sharding function, and SQL parsing using pg_query. Experiments demonstrate PgDog's effectiveness, offering optimizations like parallel cross-shard queries and refined centroid allocation. Future work includes supporting more distance algorithms and SIMD instructions for faster calculations.

Read more
Development sharding

Postgres Sharding: A Thrilling Tale of Scaling to 6x

2025-03-14
Postgres Sharding: A Thrilling Tale of Scaling to 6x

A company faced a challenge with PostgreSQL's write capacity, handling 100,000 users/second. Instead of migrating to NoSQL, the engineering team chose to shard their database. They split the database into 6 instances, syncing data with logical replication. This involved writing Ruby and Python code to handle sharding keys and custom tools to address sequence issues. The successful 6x expansion resulted in the creation of PgDog, an open-source project for automated Postgres sharding. This story highlights the ingenuity and determination of engineers, and the scalability of PostgreSQL.

Read more
Development database sharding