BQN Matrix Multiplication Performance Optimization: Cache Blocking and Divide and Conquer

2025-06-27

This article explores optimizing large matrix multiplication performance using the BQN language. The author first uses a simple square partitioning method to effectively utilize cache, achieving a speedup of about six times. Then, a Strassen algorithm based on a divide-and-conquer strategy is introduced and experimentally shown to achieve up to a 9x speedup on large matrices. The article also compares the performance impact of different block sizes and nested tiling strategies, concluding that the performance limit of a pure, single-threaded BQN implementation has essentially been reached.

Development