Deep Dive: GPU vs. TPU Architectures for LLMs

2025-08-20

This article provides a detailed comparison of GPU and TPU architectures, focusing on their core compute units, memory hierarchies, and networking capabilities. Using the H100 and B200 GPUs as examples, it meticulously dissects the internal workings of modern GPUs, including Streaming Multiprocessors (SMs), CUDA Cores, Tensor Cores, and the interplay between various memory levels (SMEM, L2 Cache, HBM). The article also contrasts GPU and TPU performance in collective communication (e.g., AllReduce, AllGather), analyzing the impact of different parallelism strategies (data parallelism, tensor parallelism, pipeline parallelism, expert parallelism) on large language model training efficiency. Finally, it summarizes strategies for scaling LLMs on GPUs, illustrated with DeepSeek v3 and LLaMA-3 examples.

Read more
AI

The Alchemy of Efficient LLM Training: Beyond Compute Limits

2025-02-04

This article delves into the efficient training of large language models (LLMs) at massive scale. The author argues that even with tens of thousands of accelerators, relatively simple principles can significantly improve model performance. Topics covered include model performance assessment, choosing parallelism schemes at different scales, estimating the cost and time of training large Transformer models, and designing algorithms that leverage specific hardware advantages. Through in-depth explanations of TPU and GPU architectures, and a detailed analysis of the Transformer architecture, readers will gain a better understanding of scaling bottlenecks and design more efficient models and algorithms.

Read more