ISL Colloquium

← List all talks ...

Optimization, Robustness and Attention in Deep Learning: Insights from Random and NTK Feature Models

Marco Mondelli – Assistant Professor, IST Austria

Thu, 11-Apr-2024 / 4:00pm / Packard 202

Slides

Abstract

A recent line of work has analyzed the properties of deep learning models through the lens of Random Features (RF) and the Neural Tangent Kernel (NTK). In this talk, I will show how concentration bounds on RF and NTK maps provide insights on (i) the optimization of the network via gradient descent, (ii) its adversarial robustness, and (iii) the success of attention-based architectures, such as transformers.

I will start by proving tight bounds on the smallest eigenvalue of the NTK for deep neural networks with minimum over-parameterization. This implies that the network optimized by gradient descent interpolates the training dataset (i.e., reaches 0 training loss), as soon as the number of parameters is information-theoretically optimal. Next, I will focus on the robustness of the interpolating solution. A thought-provoking paper by Bubeck and Sellke has proposed a “universal law of robustness”: interpolating smoothly the data necessarily requires many more parameters than simple memorization. By providing sharp bounds on RF and NTK models, I will show that, while random features are never robust (regardless of the over-parameterization), NTK features saturate the universal law of robustness, thus addressing a conjecture by Bubeck, Li and Nagaraj. Finally, I will consider attention-based architectures, showing that random attention features are sensitive to a change of a single word in the context, as expected from a model suitable for NLP tasks. In contrast, the sensitivity of random features decays with the length of context. This property translates into generalization bounds: due to their low word sensitivity, random features provably cannot learn to distinguish between two sentences that differ only in a single word. In contrast, due to their high word sensitivity, random attention features have higher generalization capabilities.

Bio

Marco Mondelli received the B.S. and M.S. degree in Telecommunications Engineering from the University of Pisa, Italy, in 2010 and 2012, respectively. In 2016, he obtained his Ph.D. degree in Computer and Communication Sciences at the École Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He is currently an Assistant Professor at the Institute of Science and Technology Austria (ISTA). Prior to that, he was a Postdoctoral Scholar in the Department of Electrical Engineering at Stanford University, USA, from February 2017 to August 2019. He was also a Research Fellow with the Simons Institute for the Theory of Computing, UC Berkeley, USA, for the program on Foundations of Data Science from August to December 2018. His research interests include data science, machine learning, information theory, and modern coding theory. He was the recipient of a number of fellowships and awards, including the Jack K. Wolf ISIT Student Paper Award in 2015, the STOC Best Paper Award in 2016, the EPFL Doctorate Award in 2018, the Simons-Berkeley Research Fellowship in 2018, the Lopez-Loreta Prize in 2019, and Information Theory Society Best Paper Award in 2021.