Quantifying and reducing bias in data exploration using information theory

Speaker: James Zou
Microsoft Research

Venue: Packard 202
Time: 4 pm to 5 pm
Date: Wednesday, January 27, 2016

Abstract

Modern data is messy and high-dimensional, and it is often not clear a priori what to look for. Instead, a human or an analysis algorithm needs to explore the data to identify interesting hypotheses to test. It is widely recognized that this exploration, even when well-intentioned, can lead to statistical biases and false discoveries. We propose a general framework using mutual information to quantify and provably bound the bias (and other properties) of arbitrary data exploration processes. We show that our bound is tight in natural settings, and apply it to characterize conditions under which common analytic practices, e.g. rank selection, LASSO and hold-out sets, do or do not lead to substantially biased estimation. Finally we show how, by viewing bias through this information lens, we can derive randomization approaches that effectively reduce false discoveries.

Speaker Bio

James Zou is a postdoc at Microsoft Research New England. He works on machine learning and applications to human genomics. He received his Ph.D. from Harvard University in May 2014 and also spent half time at the Broad Institute, supported by a NSF Graduate Fellowship. In Spring 2014, he was a Simons research fellow at U.C. Berkeley.