Sanjam Garg | ISL Colloquium

Abstract

Large language models (LLMs) deployed as agents routinely process sensitive information (credentials, personal data, proprietary instructions) as in-context secrets stored within memory files, or retrieved documents. We show that even when models behave correctly and refuse to reveal such information, it can still leak through subtle statistical patterns in their outputs, allowing reconstruction from benign interactions.

We then consider adversarial inputs and show that standard defenses, such as prompt filtering, face inherent limitations. A fundamental asymmetry between lightweight guard models and the systems they protect allows adversarial prompts to evade detection while remaining interpretable to the model.

Taken together, these results suggest that current LLM architectures lack mechanisms to guarantee secrecy under adversarial interaction.

Bio

Prof. Sanjam Garg is an Associate Professor at the University of California, Berkeley. His research interests are in cryptography and its applications to security and privacy. He obtained his Ph.D. from the University of California, Los Angeles in 2013 and his undergraduate degree from the Indian Institute of Technology, Delhi in 2008. Prof. Garg is the recipient of various honors such as the ACM Doctoral Dissertation Award, the Sloan Research Fellowship and the IIT Delhi Graduates of the Last Decade Award. Prof. Garg’s research has been recognized with a test of time award at FOCS 2023, and best paper awards at EUROCRYPT 2013, CRYPTO 2017 and EUROCRYPT 2018. Past students and postdoctoral researchers from Prof. Garg’s research group are now faculty/researchers at top institutions, such as Columbia University, Brown University, the University of Toronto, Microsoft Research, etc.

Can LLMs safely handle sensitive information?

Abstract

Bio