Blazing Fast Fibonacci on the GPU with Thrust

2025-06-27
Blazing Fast Fibonacci on the GPU with Thrust

This blog post demonstrates how to perform incredibly fast Fibonacci sequence calculations using GPU programming and the NVIDIA Thrust library. It starts by explaining the scan algorithm, then shows how to use scan operations in Thrust for simple addition and multiplication, extending this to matrix operations. Finally, it illustrates calculating Fibonacci numbers efficiently via matrix operations and the scan operation, using modulo arithmetic to avoid integer overflow. The author calculates F99999999 (mod 9837) in just 17 milliseconds on an NVIDIA GeForce RTX 3060 Mobile GPU.

Read more

Highly Efficient Matrix Transpose in Mojo: Beating CUDA?

2025-06-06
Highly Efficient Matrix Transpose in Mojo: Beating CUDA?

This blog post details how to implement a highly efficient matrix transpose kernel on the Hopper architecture using Mojo. The author walks through optimizations, starting from a naive approach and culminating in a kernel achieving 2775.49 GB/s bandwidth—competitive with, and potentially exceeding, equivalent CUDA implementations. Optimizations include using TMA (Tensor Map Access) descriptors, shared memory optimizations, data swizzling, and thread coarsening. The post dives into the implementation details and performance gains of each technique, providing complete code examples.

Read more