ISL Colloquium
Menu Close

Toward Trustworthy AI: Principled and Automated Interpretability in Deep Learning

Lily Weng
Assistant Professor, UCSD
Thursday, July 31, 2025 at 4:00 PM • Packard 202

Abstract

In this talk, I will share recent advancements from my lab in explainable AI and interpretable machine learning for deep vision and language models. I will present a series of works that aim to bring interpretability into deep learning across three fronts: (i) unveiling the inner workings of neural networks through automated mechanistic interpretability techniques [1-3], (ii) developing inherently interpretable architectures such as concept bottleneck models [4-7], and (iii) establishing a unified framework for evaluating neuron-level explanations, including practical guidelines and a set of reliable evaluation metrics [8-9]. Together, these efforts mark important steps toward building trustworthy and transparent deep learning systems.

References

[1] Oikarinen and Weng, CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks, ICLR 23 (spotlight)

[2] Oikarinen and Weng, Linear Explanations for Individual Neurons, ICML 24

[3] Bai & Iyer etal, Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models, TMLR 25

[4] Oikarinen etal, Label-Free Concept Bottleneck Models, ICLR 23

[5] Srivastava & Yan etal, VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 24

[6] Sun etal, Concept Bottleneck Large Language Models, ICLR 25

[7] Kulkarni etal, Interpretable Generative Models through Post-hoc Concept Bottlenecks, CVPR 25

[8] Oikarinen etal, Evaluating Neuron Explanations: A unified Framework with Sanity Checks, ICML 25 (accepted)

[9] Oikarinen etal, Rethinking Crowd-Sourced Evaluation of Neuron Explanations, arxiv preprint 25

Bio

Lily Weng is an Assistant Professor in the Halıcıoğlu Data Science Institute at UC San Diego with affiliation in the CSE department. She received her PhD in Electrical Engineering and Computer Science (EECS) from MIT in August 2020, and her Bachelor and Master degree both in Electrical Engineering at National Taiwan University. Prior to UCSD, she spent 1 year in MIT-IBM Watson AI Lab and several research internships in Google DeepMind, IBM Research and Mitsubishi Electric Research Lab. Her research interest is in machine learning and deep learning, with primary focus on Trustworthy AI. Her vision is to make the next generation AI systems and deep learning algorithms more robust, reliable, explainable, trustworthy and safer. Her work has been recognized and supported by several NSF awards, ARL award, Intel Rising Star Faculty Award, Hellman Fellowship, and Nvidia Academic award. For more details, please see https://lilywenglab.github.io/.