Positioning Mechanistic Interpretability

Derek Larson
5 min readAug 27, 2023

The field of artificial intelligence continues to debate its path forward, with a wide spectrum of takes including Yann LeCun’s “nothing to see here” to Elon Musk discussing societal risk. Part of this debate includes the role of one slice basic, foundational science — developing a microscopic theory of how neural networks work, generally termed “mechanistic interpretability”. A recent post on LessWrong caught my attention because it was rather critical of the value of interpretability work, and the comments didn’t push back as much as I’d expect. So I thought I’d lay out a series of concise arguments that either weren’t raised or could use amplifying.

Key Points

  • AI is pre-paradigmatic, paradigms are how we build understanding, and developing a microscopic theory seems the best path towards a paradigm.
  • Without a paradigm, we will largely continue with empirical science. Continuing empirically has an analog to artificial selection, where we should expect unintended consequences in our systems.
  • MechInterp doesn’t seem like an accelerator. If Sutton’s Bitter Lesson is true, diverting resources to MechInterp and other safety measures should actually dampen capabilities progress.
  • Meanwhile, MechInterp is still early, we shouldn’t prematurely draw conclusions about effectiveness.

Establishing a Paradigm

It is generally natural to establish a paradigm for any scientific field, in order to focus and facilitate frontier research. More to the point, the paradigm is an abstraction that allows us to better understand the topic. In contrast, one can engage in empirical science or engineering, without a paradigm, in order to push outcomes. By iterating through experiments that make small changes to a system, and watching the metrics of interest, those metrics can be made to increase. This is mostly the current state of how AI progress is made.

Sometimes, there is no expectation for an appropriate paradigm. When the Romans developed their formula for concrete, it wasn’t because they knew the relevant materials science. A mix of trial-and-error and serendipity (having the right ash available) is enough. Only in 2023 can we finally explain why the concrete is so durable. And the Roman Senate had no need to call for a slowdown in concrete development.

However, for any advancement that has many unknowns and broad impact, the paradigm allows us to build certainty around its use. An obvious (edge) case would be the Manhattan Project. Physics provided a sufficient paradigm to predict outcomes of a nuclear bomb, so its development was more theoretical than empirical — thus sufficiently safe (from unintended consequences; plenty of problems with the intended ones…). A more relevant example is the development of CRISPR. Gene-editing has sufficient unknowns and impact that a moratorium was called (pdf here). There’s plenty of ancillary reasons for the moratorium, but the straightforward “we can’t sufficiently predict how editing a gene affects humans” is the focus here. And I don’t believe there’s any debate that paradigmatic research in genetics needs to fill this gap (see the “technical considerations” subsection of “the need” in the Nature article).

Similarly, its hard to see how we can safely progress in AI without building a paradigm. The real debate is, perhaps, at what scale do we pursue this paradigm? MechInterp is the fully reductionist approach, which I naturally lean towards. We may find alternatives, especially if we start developing more modular architectures. In a mixture of experts model, we might be able to sufficiently understand how the experts interact to give us enough confidence in predicting their behavior, without understanding the internals of the expert. However, it seems difficult to develop modular architectures without MechInterp! Perhaps one approach is to use MechInterp only to “ladder up” to modular architectures, and then focus on that scale (and climb any more rungs as needed).

I think a good way to think about paradigms and safety is the following: paradigms allow us access to different conceptual scales, which may have a smaller “surface area” than the full system. A good analogy may be unit testing for software. We don’t want to continually run full systems tests to try and debug our software; its useful to have unit tests that limit the surface area, localizing the problems. A multi-scaled paradigm would allow us to find the right unit tests of AI, so we don’t regress as we develop.

Artificial Selection in AI

I wrote about this bullet point here. I’ll aim to condense that and connect it to this thread later!

The Bitter Lesson

Rich Sutton’s Bitter Lesson essentially says that progress in AI is driven by compute scaling, not understanding of fundamentals. It’s definitely held true quite generally, and we should be afraid that it will continue to hold true for a while. Meanwhile, some argue that putting more research into MechInterp could accelerate capabilities and isn’t worth the net benefit. These seem to be opposing points, so can we discern which one applies?

Probably we simply don’t know. If MechInterp aids capabilities, that may simply get overshadowed by the next large-scale, empirical architecture shift. The moral part of the bitter lesson is how we keep going back to the trough thinking “this time we’ll understand!”. One might even argue that MechInterp takes away resources from further empirical exploration, which has historically been the successful path. My underlying take is that it is probably too hard to parse right now, and you need clear arguments to convince people to change. As the field evolves, it does deserve regular check-ins.

For example, I think MechInterp has subfields we haven’t developed yet. One of these we could call “circuits”, which focuses on understanding structures of networks that yield capabilities — say induction heads. Another subfield could be labeled “representation”, aimed at understanding internal state (e.g. LogitLens). Clearly, these help each other and its a tenuous divide. I’d argue the “circuits” work is more likely to yield capabilities advancements, by identifying important components and how they arise. On the other hand, representation work might focus on: defining how we think of features, parsing network state, understanding the process of extracting features. It would have fewer opinions on how networks should be designed, and more on how can we tell what a model is “thinking”. By taking an intentional path towards developing representation science, perhaps we lessen the impact on capabilities.

MechInterp is a Nascent Field

One point from the LessWrong post argues that MechInterp results are overhyped, and direct adversarial research is having success, and I think it’s common to believe MechInterp is failing, too hard, not fast enough, etc. My take is that the field is simply too early to make such measurements. Finding adversarial attacks are currently easy because empirical science can’t effectively protect against that. We end up with a catch-22 where “look, we shouldn’t do MechInterp because direct attacks work”, “we can’t reasonably prevent some user finding an attack without some paradigm for safety”, “MechInterp is our best approach to establishing a safety paradigm”.

My simple answer here is to re-evaluate the field in a few years. This may be enough time to discern a trajectory for where it goes: impact on capabilities, contributions to safety.

--

--