Marginalia Search Index: A Significant Performance Boost

2025-08-17

The Marginalia search engine has undergone a significant index redesign to better leverage modern hardware. By employing memory-mapped B-trees and deterministic block-based skip lists, along with careful tuning of block sizes and I/O strategies, search speeds have been dramatically improved. The post details the new data structures and performance optimizations, exploring the idiosyncrasies of NVMe SSD read performance and how to maximize performance through block size and I/O mode adjustments.

Read more
Development

Marginalia Search Engine Upgrades: Online Status and Ownership Change Detection

2025-06-19

The Marginalia Search Engine team implemented a new system, 'ping-process,' to detect server online status and significant website changes, including ownership transfers and parking. Primarily using HTTP HEAD requests and DNS queries, the system analyzes certificate details, security posture, and server headers to identify changes. Data is stored in 'snapshot' and 'event' tables, the former holding current information and the latter historical events. The system overcame scheduling and certificate validation challenges, showing early success in identifying parked domains. Future plans include refining the ownership change detection model and integrating it into crawler strategies for improved efficiency.

Read more

AI Overload: A Day in the Dystopian Future?

2025-05-23

From an AI alarm clock to a gym with excessive security measures and a car constantly boasting about its features, the protagonist's day is overwhelmed by pervasive AI. This seemingly convenient future is filled with suffocating annoyances and privacy violations, prompting reflection on the overdevelopment of AI technology and the lack of human interaction.

Read more
Tech

Search Engine Adds PDF Indexing: Conquering the Challenges of Text Extraction

2025-05-13

The search engine recently gained the ability to index PDFs, a feat more complex than it seems. PDFs aren't text-based; they're graphical, representing text as glyph coordinates that may be rotated, overlapping, or disordered. This article details improvements to PDFBox's PDFTextStripper class. By statistically analyzing font sizes and line spacing, it more effectively identifies semantic information like headings and paragraphs. This enhances the accuracy and suitability of PDF text extraction, enabling effective indexing of PDF content.

Read more
Development PDF indexing

Search Engine Crawler Optimization: The Long Tail of 0.1%

2025-03-27

A search engine's crawler consistently struggled to finish its task, spending days on the final domains. Recent migration to slop crawl data reduced memory usage by 80%, increasing crawling tasks. This resulted in 99.9% completion in 4 days, but the remaining 0.1% took a week. The issue stems from website size following a Pareto distribution, with large websites (especially academic ones with numerous subdomains and documents) and crawler limits on concurrent tasks per domain. Initial random ordering caused large sites to start late. Sorting by subdomain count led to a surge of requests to blog hosts. Adding request delay jitter and adjusting the sort order to prioritize sites with more than 8 subdomains partially solved the problem. However, inherent limitations of the batch crawling model require further optimization.

Read more
Development crawler optimization

AI Startup Guide: Become a Worse Netizen

2025-03-22

This satirical piece details the extreme measures an AI startup takes to obtain training data. Ignoring robots.txt and forging user-agents, they ruthlessly crawl forms, Git repositories, and even hijack their neighbor's Wi-Fi. They avoid connection pooling, refuse to close connections, and deliberately drop packets, all in the name of speed and data acquisition. The story humorously highlights the reckless disregard for rules and ethics exhibited by some AI startups in their pursuit of success, ultimately resulting in reputational damage.

Read more
Startup