Rethinking Interpretability: It's not about neurons
When people talk about interpretability, the instinct is to look for features
When people talk about interpretability, the instinct is to look for features. Which neuron fires for sentiment? Which direction encodes truthfulness? Which circuit causes hallucination?
But this paper points toward a different lens. It’s not about individual neurons. It’s about directions and surfaces.
The researchers didn’t find a “character counter neuron.” They found that counting corresponds to movement along a low-dimensional manifold.
That means computation can emerge from geometry rather than explicit symbolic rules. This matters.
If structured variables are encoded as geometric trajectories:
Interpretability tools need to detect manifolds, not just activations. Steering a model might require nudging it along or across geometric directions. Alignment may depend on understanding these internal structures at scale.
We often describe large language models as black boxes. But black boxes can have internal architecture.
And if geometry underlies more advanced reasoning behaviours - planning, tracking uncertainty, maintaining context - then understanding that geometry becomes foundational.
In the final article, I’ll zoom out and ask a bigger question: Are we discovering that language models are more internally structured than we assumed?
