Michael Shi | ISL Colloquium

Abstract

As AI systems scale, one avenue for improving the compute efficiency of pre-training is through fundamental developments in optimization algorithms. In this talk, I will discuss our two recent works in this area.

First, we dissect the relationship between two popular matrix-based optimizers Shampoo and Muon. We argue that Shampoo can be understood as an adapted Muon algorithm, analogous to Adam’s relationship with Signum. Through extensive controlled experiments, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam’s advantage over Signum. Consistent with this interpretation, we propose a new perspective towards understanding Shampoo: rather than an approximation of full-matrix Adam, it can be understood as adapting the semi-orthogonality constraint in spectral descent to the time-averaged and stochastic setting.

Second, we propose non-Euclidean gradient noise scales for stochastic sign and spectral descent based on their underlying geometry. These metrics bound the bias induced by nonlinear operators (such as sign or semi-orthogonalization) and reveal critical inflection points in the per-step improvement as a function of batch size for sign- and spectral-based methods. We empirically validate our adaptive batch size approach by matching the validation loss of constant-batch baselines while reducing the total number of training steps for both Signum and Muon on Llama 3 models.

Bio

Michael is a Research Scientist in the Kernels and Optimizations team within Meta Superintelligence Labs. He obtained his B.S. degree in Applied Mathematics from the University of California, Los Angeles, and his Ph.D. from Northwestern University in Industrial Engineering and Management Sciences under the supervision of Prof. Jorge Nocedal. His team recently won the MLCommons’ AlgoPerf Training Algorithms competition (external tuning track). He previously received the Walter P. Murphy Fellowship at Northwestern, the NSF Graduate Research Fellowship Honorable Mention in 2016 and 2017, and was a top ICML reviewer in 2019. His current research interests are in the design and implementation of scalable and distributed training algorithms and systems for deep learning. He previously contributed to the areas of stochastic optimization, noisy optimization, and derivative-free optimization as well as recommender systems and embedding compression.

Beyond AdamW: Developments in Training Algorithms for Deep Learning

Abstract

Bio