Building a Web Search Engine from Scratch: 3 Billion Embeddings and 2 Months of Hustle
2025-08-13
The author recounts their two-month journey building a web search engine from scratch, leveraging 3 billion SBERT embeddings. Motivated by the shortcomings of existing search engines – excessive SEO spam and insufficient high-quality content – the project aimed to improve search relevance and understanding of complex queries. The post details the process, covering data crawling, text normalization, chunking, semantic context handling, embedding generation, storage (using RocksDB and HNSW), and retrieval. The resulting engine boasts 500ms query latency and handles complex natural language queries, surfacing high-quality results.