Seeing Is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

Author:

Liu Ziming1,Gan Eric1,Tegmark Max1ORCID

Affiliation:

1. Institute for Artificial Intelligence and Fundamental Interactions, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Abstract

We introduce Brain-Inspired Modular Training (BIMT), a method for making neural networks more modular and interpretable. Inspired by brains, BIMT embeds neurons in a geometric space and augments the loss function with a cost proportional to the length of each neuron connection. This is inspired by the idea of minimum connection cost in evolutionary biology, but we are the first the combine this idea with training neural networks with gradient descent for interpretability. We demonstrate that BIMT discovers useful modular neural networks for many simple tasks, revealing compositional structures in symbolic formulas, interpretable decision boundaries and features for classification, and mathematical structure in algorithmic datasets. Qualitatively, BIMT-trained networks have modules readily identifiable by the naked eye, but regularly trained networks seem much more complicated. Quantitatively, we use Newman’s method to compute the modularity of network graphs; BIMT achieves the highest modularity for all our test problems. A promising and ambitious future direction is to apply the proposed method to understand large models for vision, language, and science.

Funder

The Casey Family Foundation, the Foundational Questions Institute, the Rothberg Family Fund for Cognitive Science, the NSF Graduate Research Fellowship

IAIFI through NSF

Publisher

MDPI AG

Subject

General Physics and Astronomy

Reference44 articles.

1. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. (2023, November 01). Zoom In: An Introduction to Circuits. Distill 2020. Available online: https://distill.pub/2020/circuits/zoom-in.

2. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., and Chen, A. (2023, November 01). In-Context Learning and Induction Heads. Transform. Circuits Thread 2022. Available online: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.

3. Michaud, E.J., Liu, Z., Girit, U., and Tegmark, M. (2023). The Quantization Model of Neural Scaling. arXiv.

4. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., and Conerly, T. (2023, November 01). A Mathematical Framework for Transformer Circuits. Transform. Circuits Thread 2021. Available online: https://transformer-circuits.pub/2021/framework/index.html.

5. Wang, K.R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023, January 1–5). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3