SemHash: Blazing Fast Semantic Text Deduplication

2025-01-12
SemHash: Blazing Fast Semantic Text Deduplication

SemHash is a lightweight and flexible tool for deduplicating datasets using semantic similarity. It combines fast embedding generation from Model2Vec with efficient ANN-based similarity search through Vicinity. SemHash supports both single-dataset and multi-dataset deduplication and handles simple datasets like text lists and complex ones like multi-column QA datasets. It includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process. Benchmarks show SemHash is extremely fast and scales to large datasets with millions of records.