TorchFT: Fault-Tolerant LLM Training Under Extreme Failure Rates
2025-06-27
Researchers used TorchFT and TorchTitan to train a model in a real-world environment with extreme synthetic failure rates to prove the reliability and correctness of fault-tolerant training. Even with 1200 failures and no checkpoints, training loss remained stable. TorchFT uses a global Lighthouse server and per-replica group Managers for real-time coordination and implements various fault-tolerant algorithms such as Fault-Tolerant HSDP and LocalSGD/DiLoCo. Experimental results demonstrate that even under extremely high failure rates, TorchFT effectively trains the model, showcasing its robustness in handling various failure scenarios.