zymtrace: Frictionless GPU Profiling to Unlock Full Potential

zymtrace is a lightweight, production-grade, continuous GPU profiler that seamlessly traces performance bottlenecks—kernel stalls, memory contention, scheduling delays—directly back to their source in PyTorch code, CUDA kernels, native functions, or scheduler threads. Unlike existing solutions, zymtrace provides whole-system visibility, correlating GPU traces with the CPU code paths that triggered them. This allows AI/ML engineers to optimize CUDA kernel launches, determine optimal batch sizes, and address low GPU utilization, maximizing GPU performance and reducing costs.
Read more