Statistical Inference in an Interactive Learning Paradigm
NOTE: This is the second talk scheduled for the week of April 27. The start time is 3 PM rather than the usual time of 4 PM. There is still a talk scheduled for April 30.
Abstract
The proliferation of generative artificial intelligence has given rise to an interactive learning paradigm, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning theory, and (2) model training processes become inherently correlated, as models influence one another through repeated exposure to each other’s synthetic outputs. Establishing reliable statistical inference in such interactive environments therefore remains an important open problem. In particular, there is growing concern about model collapse, a phenomenon in which the performance of generative models progressively degrades as they are trained on synthetic data produced by earlier model generations.
In this work, we study the behavior of generative models under general interaction patterns. We formalize these interactions using directed graphs and show that the occurrence of model collapse depends critically on the graph’s topology. Within this framework, we derive an explicit necessary and sufficient condition characterizing when model collapse occurs. Our analysis covers both finite-sample results for linear regression and asymptotic guarantees for general M-estimators. We further validate our theoretical findings through extensive numerical experiments. This is based on joint work with Kangjie Zhou and Weijie Su.
Bio
Yuchen Wu is an assistant professor in Cornell’s School of Operations Research and Information Engineering. Prior to Cornell, she was a postdoctoral researcher in the Department of Statistics and Data Science at the Wharton School, University of Pennsylvania. She received her Ph.D. in 2023 from the Department of Statistics at Stanford University.
Her research lies at the intersection of statistics, machine learning, and game theory, with interests spanning diffusion models, high-dimensional statistics, statistical sampling, and mechanism design.