No More Adam: Learning Rate Scaling at Initialization is All You Need
2024-12-18
Researchers introduce SGD-SaI, a novel optimizer improving stochastic gradient descent. SGD-SaI addresses training imbalances by scaling learning rates at initialization for different parameter groups based on their gradient signal-to-noise ratios. Significantly more memory-efficient than AdamW, SGD-SaI matches or surpasses AdamW's performance across various Transformer-based tasks, including ImageNet classification and LLM pretraining. Its robustness and practicality are demonstrated across diverse applications, making it a compelling alternative.
AI