Search Engine Crawler Optimization: The Long Tail of 0.1%

2025-03-27

A search engine's crawler consistently struggled to finish its task, spending days on the final domains. Recent migration to slop crawl data reduced memory usage by 80%, increasing crawling tasks. This resulted in 99.9% completion in 4 days, but the remaining 0.1% took a week. The issue stems from website size following a Pareto distribution, with large websites (especially academic ones with numerous subdomains and documents) and crawler limits on concurrent tasks per domain. Initial random ordering caused large sites to start late. Sorting by subdomain count led to a surge of requests to blog hosts. Adding request delay jitter and adjusting the sort order to prioritize sites with more than 8 subdomains partially solved the problem. However, inherent limitations of the batch crawling model require further optimization.

Read more
Development crawler optimization

AI Startup Guide: Become a Worse Netizen

2025-03-22

This satirical piece details the extreme measures an AI startup takes to obtain training data. Ignoring robots.txt and forging user-agents, they ruthlessly crawl forms, Git repositories, and even hijack their neighbor's Wi-Fi. They avoid connection pooling, refuse to close connections, and deliberately drop packets, all in the name of speed and data acquisition. The story humorously highlights the reckless disregard for rules and ethics exhibited by some AI startups in their pursuit of success, ultimately resulting in reputational damage.

Read more
Startup