494x Faster Word Counting with SIMD and Threads

2025-08-17

This article details the author's journey in optimizing a word counting program, achieving a remarkable 494x speedup. Starting with a naive Python implementation (89.6 seconds), the author progressively improved performance using CPython's `re` module (13.7 seconds), a C scalar loop (1.205 seconds), and finally, SIMD instructions and multithreading (181 milliseconds). Each optimization step is explained, covering leveraging C extensions, efficient C loops, and multi-core CPU utilization. While multithreading yielded less than expected gains, the final version reached an impressive 5.52 GiB/s processing speed. The author invites readers to suggest further optimizations.

Development