Tetiana Parshakova | ISL Colloquium

NOTE: This is the first talk scheduled for the week of April 27. There is a second talk scheduled for May 1.

Abstract

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad and Adam) has been the class of convex and Lipschitz functions. This talk asks whether the classical convex Lipschitz model is a useful model for understanding Muon.

The answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions. We also show that error feedback restores convergence, more generally, for non-Euclidean subgradient methods with momentum. However, this fix moves Muon towards the Euclidean subgradient descent method with momentum, and in experiments, it weakens rather than improves the method.

Our conclusion is that convex Lipschitz theory, when predictive for adaptive metric methods, is the wrong one for Muon. This suggests that Muon’s success must come from structure absent from this model, most plausibly smoothness.

This is a joint work with Ahmed Khaled, Guillaume Garrigos, Michael Crawshaw, and Robert Gower.

Bio

Tetiana Parshakova is a Flatiron Research Fellow at the Center for Computational Mathematics. Her research develops efficient algorithms for large-scale computational problems using tools from convex optimization, statistics, and numerical linear algebra. Before joining Flatiron, she was a postdoctoral scientist at Amazon SCOT. She earned a Ph.D. in Computational Mathematics from Stanford University and holds a B.S. in Industrial Design and an M.S. in Electrical Engineering from KAIST. At Flatiron, she focuses on developing and analyzing new optimization techniques for training large language models.

Muon Does Not Converge on Convex Lipschitz Functions

Abstract

Bio