ViTs vs. CNNs: Speed Benchmarks Shatter Resolution Myths

2025-05-04

This article challenges the common belief that Vision Transformers (ViTs) are inefficient for high-resolution image processing. Through rigorous benchmarking across various GPUs, the author compares the inference speed, FLOPs, and memory usage of ViTs and Convolutional Neural Networks (CNNs). Results show ViTs perform exceptionally well up to and including 1024x1024 pixels, often outperforming CNNs on modern hardware in both speed and memory efficiency. The author also argues against an overemphasis on high resolution, suggesting that lower resolutions are often sufficient. Finally, the article introduces local attention mechanisms, further enhancing ViT efficiency at higher resolutions.

AI