GPU-Accelerated RNNs: A CUDA Implementation of minGRU and minLSTM

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

GPU-Accelerated RNNs: A CUDA Implementation of minGRU and minLSTM

2025-09-21

This blog post details a final project for Caltech's CS179: GPU Programming, verifying the claims of Feng et al.'s paper, “Were RNNs All We Needed?” The project implemented simplified minGRU and minLSTM models and a custom CUDA parallel scan algorithm. Results showed significant GPU speedups for long sequences, validating the paper's core finding that RNN recurrence can be parallelized. However, for short sequences, CUDA kernel launch overhead negated some performance gains. GPU kernel profiling revealed the final projection layer as the primary bottleneck, suggesting further optimization via a single cuBLAS GEMM call.

(dhruvmsheth.github.io)

Development parallel algorithms

The Rise of the AI Code Cleanup Economy

Insanely Difficult Color Puzzle Game