FastVLM: Blazing Fast Vision Encoding for Vision Language Models

2025-05-13
FastVLM: Blazing Fast Vision Encoding for Vision Language Models

FastVLM introduces a novel hybrid vision encoder, dramatically reducing encoding time and token output for high-resolution images. Even the smallest variant boasts an 85x faster Time-to-First-Token (TTFT) and a 3.4x smaller vision encoder than LLaVA-OneVision-0.5B. Larger variants, paired with Qwen2-7B LLM, outperform recent models like Cambrian-1-8B, achieving a 7.9x faster TTFT. A demo iOS app showcases its mobile performance. The project provides detailed instructions for inference and supports Apple Silicon and Apple devices.