Training the Strongest Model on a MacBook Pro in 5 Minutes: A Challenge

2025-08-14

The author challenges himself to train the strongest possible language model on a MacBook Pro in just five minutes. Experiments culminated in a ~1.8M parameter GPT-style transformer trained on ~20M TinyStories tokens, achieving ~9.6 perplexity. Optimizations focused on maximizing tokens-per-second, favoring MPS and avoiding gradient accumulation. Dataset selection proved crucial, with TinyStories' coherent, simple language proving superior. Transformers outperformed LSTMs and diffusion models. The optimal model size for a five-minute training window was found to be around 2M parameters, aligning with Chinchilla scaling laws.

AI