Search Engine Crawler Optimization: The Long Tail of 0.1%

2025-03-27

A search engine's crawler consistently struggled to finish its task, spending days on the final domains. Recent migration to slop crawl data reduced memory usage by 80%, increasing crawling tasks. This resulted in 99.9% completion in 4 days, but the remaining 0.1% took a week. The issue stems from website size following a Pareto distribution, with large websites (especially academic ones with numerous subdomains and documents) and crawler limits on concurrent tasks per domain. Initial random ordering caused large sites to start late. Sorting by subdomain count led to a surge of requests to blog hosts. Adding request delay jitter and adjusting the sort order to prioritize sites with more than 8 subdomains partially solved the problem. However, inherent limitations of the batch crawling model require further optimization.

Development crawler optimization