GPU-Accelerated RNNs: A CUDA Implementation of minGRU and minLSTM

2025-09-21

This blog post details a final project for Caltech's CS179: GPU Programming, verifying the claims of Feng et al.'s paper, “Were RNNs All We Needed?” The project implemented simplified minGRU and minLSTM models and a custom CUDA parallel scan algorithm. Results showed significant GPU speedups for long sequences, validating the paper's core finding that RNN recurrence can be parallelized. However, for short sequences, CUDA kernel launch overhead negated some performance gains. GPU kernel profiling revealed the final projection layer as the primary bottleneck, suggesting further optimization via a single cuBLAS GEMM call.

Development parallel algorithms