Finite State Machines as Data Structures: Indexing Billions of URLs
This article explores using finite state machines (FSMs) as data structures for representing ordered sets and maps, showcasing the efficiency of Rust's fst crate for building indexes. It delves into FSM construction, covering Trie and FSA construction, and demonstrates indexing over 1.6 billion URLs from the July 2015 Common Crawl Archive. Techniques like memory mapping, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance, and streaming set operations are also discussed. The author builds and benchmarks FSTs against other compression schemes (gzip, xz) across multiple datasets of varying sizes and characteristics.
Read more