Finite State Machines as Data Structures: Indexing Billions of URLs

2025-08-14

This article explores using finite state machines (FSMs) as data structures for representing ordered sets and maps, showcasing the efficiency of Rust's fst crate for building indexes. It delves into FSM construction, covering Trie and FSA construction, and demonstrates indexing over 1.6 billion URLs from the July 2015 Common Crawl Archive. Techniques like memory mapping, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance, and streaming set operations are also discussed. The author builds and benchmarks FSTs against other compression schemes (gzip, xz) across multiple datasets of varying sizes and characteristics.

Read more
Development Indexing

Rust's `panic` and `unwrap()`: When and How to Use Them?

2025-05-21

This article delves into the usage of `panic` and `unwrap()` in the Rust programming language. The author argues that `panic` shouldn't be used for general error handling, but as a signal of bugs within the program. `unwrap()` is acceptable in tests, example code, and prototyping, but should be used cautiously in production as it can lead to program crashes. The author thoroughly explains runtime invariants and why it's sometimes not possible or desirable to move all invariants to compile-time invariants. Finally, the author recommends using `expect()` over `unwrap()` when possible and discusses whether linting against `unwrap()` is a good idea.

Read more
Development