Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have unveiled a groundbreaking method leveraging artificial intelligence (AI) to automate the explanation of complex neural networks. The initiative introduces the FIND benchmark suite, designed to evaluate automated interpretability methods in neural networks. This suite features functions mirroring real-world network components and their complexities. Moreover, it introduces innovative interpretability agents, utilizing pre-trained language models to generate comprehensive descriptions of function behavior.
In the realm of AI, explaining the intricate behavior of trained neural networks poses a significant challenge, particularly as these models increase in size and complexity. To address this, MIT’s CSAIL team has introduced an autonomous approach using AI models to conduct experiments on other systems and elucidate their behavior. At the core of this strategy is the “automated interpretability agent” (AIA), crafted to emulate a scientist’s experimental processes, actively engaging in hypothesis formation, experimental testing, and iterative learning.
Accompanying this approach is the “function interpretation and description” (FIND) benchmark, a vital tool for evaluating interpretability procedures. Unlike existing methods, FIND provides a reliable standard for assessing interpretability by offering explanations of functions, which can be compared against function descriptions in the benchmark. This mitigates a longstanding challenge in the field, where researchers lack access to ground-truth labels or descriptions of learned computations.
Sarah Schwettmann, co-lead author and a research scientist at CSAIL, underscores the advantages of their approach. She emphasizes the capacity of AIAs for autonomous hypothesis generation and testing, potentially unveiling behaviors challenging for scientists to detect. Schwettmann envisions that FIND could play a pivotal role in interpretability research similar to how clean, simple benchmarks have driven advancements in language models.
Large language models have recently demonstrated their prowess in complex reasoning tasks across diverse domains. Recognizing this capability, the CSAIL team harnessed language models as the backbone for generalized agents, fostering automated interpretability. This approach aims to provide a universal interface for explaining various systems and synthesizing results across experiments and different modalities.
However, the team acknowledges that fully automating interpretability remains a challenge. While AIAs outperform existing methods, they still struggle to accurately describe nearly half of the functions in the benchmark. Tamar Rott Shaham, co-lead author and a postdoc in CSAIL, notes the importance of fine-tuning AIAs to overcome challenges in function subdomains with noise or irregular behavior.
Looking ahead, the researchers are developing a toolkit to enhance AIAs’ ability to conduct precise experiments on neural networks. This toolkit will equip AIAs with better tools for selecting inputs and refining hypothesis-testing capabilities, advancing neural network analysis. The ultimate goal is to develop autonomous AIAs capable of auditing other systems, contributing to the understanding and reliability of AI systems.
In summary, MIT’s innovative approach to AI interpretability showcases a significant step forward, utilizing AI to make complex neural networks more understandable and reliable. The team’s findings were presented at NeurIPS 2023 in December, reflecting a collaborative effort with coauthors from MIT and Northeastern University.
Check out the official announcement here.