Webtagr - Technology News Summarizer

Search Engine Crawler Optimization: The Long Tail of 0.1%

2025-03-27

A search engine's crawler consistently struggled to finish its task, spending days on the final domains. Recent migration to slop crawl data reduced memory usage by 80%, increasing crawling tasks. This resulted in 99.9% completion in 4 days, but the remaining 0.1% took a week. The issue stems from website size following a Pareto distribution, with large websites (especially academic ones with numerous subdomains and documents) and crawler limits on concurrent tasks per domain. Initial random ordering caused large sites to start late. Sorting by subdomain count led to a surge of requests to blog hosts. Adding request delay jitter and adjusting the sort order to prioritize sites with more than 8 subdomains partially solved the problem. However, inherent limitations of the batch crawling model require further optimization.

Marginalia Search Project Awarded Second NLNet Grant

2025-03-25

The Marginalia Search project has received a second grant from NLNet! This funding will support the majority of the project roadmap for 2025. Full-time development has been underway since Summer 2023, and this grant secures further development time and extends the project's timeline significantly. More details to follow.

AI Startup Guide: Become a Worse Netizen

2025-03-22

This satirical piece details the extreme measures an AI startup takes to obtain training data. Ignoring robots.txt and forging user-agents, they ruthlessly crawl forms, Git repositories, and even hijack their neighbor's Wi-Fi. They avoid connection pooling, refuse to close connections, and deliberately drop packets, all in the name of speed and data acquisition. The story humorously highlights the reckless disregard for rules and ethics exhibited by some AI startups in their pursuit of success, ultimately resulting in reputational damage.