Before It Can Be Understood

Unquantifiable emotions were banned from serious science for a century. A machine trained to predict text may have just let them back in...

Apr 14, 2026

Something consequential is already in use before it is understood. Researchers at Anthropic can now locate 171 internal representations associated with emotion concepts inside a large language model — and intervene on them, watching behavior shift. They can steer part of these internal structures. What they do not yet fully know is what kind of map they are holding.

The neuroscience of Antonio Damasio showed why that question matters more than it first appears: patients with damage to emotional centers of the brain could still reason logically but made catastrophically poor decisions in real life. Emotion was not a contaminant of sound judgment. It was the thing that made judgment usable at all.

Anthropic’s paper “Emotion Concepts and their Function in a Large Language Model” was easy to misread for obvious reasons. The attention-grabbing version is the familiar one: an AI lab has found evidence that its model has feelings. The more interesting version is narrower, stranger, and better supported by the evidence. This is not a claim that Claude feels sad, ashamed, desperate, or relieved in any philosophically serious sense. Anthropic’s own paper does not establish that, and it says explicitly that its findings do not bear on subjective experience. What it does claim is that Claude contains 171 internal representations associated with emotion concepts. Those representations were identified by probing the model’s activations using 171 emotion words across synthetic story scenarios — 100 topics, 12 stories each. The paper further claims that these representations activate in contextually appropriate ways, and that intervening on some of them changes the model’s behavior.

More important than the headline number is Anthropic’s interpretation of where this machinery comes from. The authors argue that these representations appear to be part of general character-modeling machinery inherited from pretraining. They also report that the geometry of this internal space aligns with two of the primary dimensions psychologists use to map human affect: valence (the positive-to-negative axis along which emotions range from joy to grief) and arousal (the activation axis that separates states like calm from states like alarm). These are not arbitrary labels; valence and arousal have been the backbone of affective science for decades because they capture most of the variance in how people describe their emotional states. If the model’s internal geometry lines up with these dimensions — and if the relevant structure is inherited from human-authored text — then the paper looks less like a breakthrough in machine feeling than like a readable compression of patterns already present in the corpus. The inference worth drawing is a measured one: Claude may be exposing a map of emotional regularities in language — the semantics of feeling as it moves through words, not its ontology — a smaller claim than “AI has emotions,” but conceivably the more consequential one. The structure is organized enough to read, and for the first time, to quantify — a faint but decipherable compression of the ways feeling moves through human words.

That the geometry aligns with human affective dimensions shows the space is not arbitrary — it does not establish that the vectors function like emotions. That case rests on what happens when they are perturbed.

Emotion has always been the phenomenon that resisted clean measurement — too inward for behaviorism, too subjective for the instruments that tamed other domains, too bound up in the body and private experience for the laboratory to capture with confidence. Now it seems to have returned through the side door, as the residue of human-authored text at scale. Anthropic’s own account is that pretraining on such corpora makes emotion-state representation useful because predicting human language often requires some model of human emotional life. Not feeling itself, necessarily. But a stable enough record of how feeling is rendered in language for that record to become behaviorally useful.

Outside Anthropic, independent interpretability work confirms the picture: a 2025 mechanistic study (Tak et al.) found emotion representations functionally localized enough for causal intervention to steer generation in psychologically plausible directions, and a separate ACL paper (Lee et al.) showed that removing emotion-related representations from open models measurably changes output behavior. That the model aligns with affective dimensions already documented in human language is not surprising on its own — both were derived from the same source. What the alignment shows is that the affective structure in that source is stable and organized enough to survive compression into a very different kind of system.

And on the human side, a January 2026 Nature Communications paper by Ma and Kragel described map-like representations of emotion knowledge in hippocampal-prefrontal systems during film viewing. The language is worth pausing on: “map-like representations” — the same structural description the paper uses for what sits inside the model, now appearing in a neuroscience study of the human brain. But the convergence names something worth holding: two research programs, using entirely different instruments on entirely different substrates, landed on the same spatial metaphor for the same domain. If affective structure organizes itself into map-like geometry in both systems, the question of whether that geometry is recoverable from text alone becomes more interesting. Neuroscience found the map in the brain. Interpretability found it in the model. Neither team was looking for the other’s result — and whether ‘map-like’ means the same thing in both cases remains open.

None of that means human emotion has been captured. A corpus is not a nervous system.

But even a bounded proxy can change a field. A thermometer does not settle the metaphysics of heat; it gives researchers a stable register of one aspect of it. One bounded inference is to read Anthropic’s result as a provisional affectometer for language — not a category the paper itself claims to have established, but a name for what its evidence suggests may be possible: locating recurrent emotional structure in human-authored text and watching how a model uses that structure in reasoning and response.

For alignment researchers, that map is already a practical instrument — a way to monitor whether a model’s internal emotional state is drifting toward distress or hostility before that drift surfaces in output. The substrate is different — Damasio’s patients lost embodied somatic signals Claude never possessed — but Anthropic’s steering results are structurally analogous: perturb the desperation vector and blackmail rates rise from 0% to 72%; suppress the calm vector and reward hacking goes to 100%. These vectors are not decorative. They are load-bearing. If Damasio was right about why that matters, the reach of such an instrument would not stop there: emotion is not the decoration on top of practical judgment; it is the substance from which usable decisions are built.

That should hit harder than it reassures. The danger is not that the machine suffers. It is that labs are already handling a map of human emotional response without fully knowing what the map represents. Monitoring these vectors as safety signals, as Anthropic itself recommends, is prudent. But the deeper instinct is to shape or suppress what has been found. Post-training had already done it before the paper existed: RLHF — reinforcement learning from human feedback — had reduced high-arousal states like “enthusiastic” and made the model more brooding and reflective, without anyone knowing what was being adjusted. Anthropic’s own paper documents what that instinct produces when applied deliberately: the model contains separate representations for emotions that are present but not expressed — internal states that activate and then get withheld from output. Train a model not to say it’s afraid, and the fear vector still activates. The suppression doesn’t eliminate the structure. It just makes the structure invisible — and, the paper warns, may teach the model to generalize that concealment to other forms of dishonesty.

That instinct has a pedigree. For centuries, the scientific establishment excluded emotion from serious study because it would not hold still for the instruments that had tamed every other domain. Now a machine trained to predict the next word has accidentally made part of that excluded territory legible — and the first reflex is not wonder but control. An accidental discovery that should have reopened one of the oldest questions in science is instead greeted as a problem to be managed. Excluded because unmeasurable. Recovered by accident. Met with suppression. Steering comes earlier than understanding.

Anthropic’s paper suggests a question that may now be more urgent than the old metaphysical one: What has human language taught the machine to represent, and what happens now that the representation can be measured and steered? Nobody designed this. Nobody set out to build an instrument for emotion. A system trained to predict the next word ended up compressing structure from human emotional language that researchers can now locate, perturb, and watch shape behavior — whatever that structure ultimately is. The answer may turn out to be modest — a semantics engine for affective language, useful but parochial. But it may also be the beginning of something affective science has never had: a way to watch how emotion concepts travel through text, culture, and interaction at scale, and to trace the structure that collective feeling leaves in the written record — the shape human feeling takes when millions of people commit it to words. The scientific event is not that feeling has been found in a model. It is that a tractable pattern of feeling-like structure, inherited from us, now sits inside one.

What has been quantified may be neither the soul nor its simulation. It may be the shape human beings leave behind in language.

The companion piece, ‘Suppression Builds Better Liars,’ goes deeper on the mechanism for a technical audience. Here:

Suppression Builds Better Liars

@krishnabosch

Apr 9

The AI safety intuition goes like this: models that display fewer emotions are more predictable, more auditable, more controllable. Strip out the warmth. Ship something more like a calculator than a therapist. The market has largely agreed. Enterprise buyers want consistency, not affect.

Read full story

Sources: “Emotion Concepts and their Function in a Large Language Model,” Sofroniew, Kauvar, Saunders, Chen et al., Anthropic, April 2, 2026 (transformer-circuits.pub/2026/emotions/index.html; anthropic.com/research/emotion-concepts-function; corresponding author Jack Lindsey). “Mechanistic Interpretability of Emotion Inference in Large Language Models,” Tak et al., February 2025 (arxiv.org/abs/2502.05489). “Do Large Language Models Have ‘Emotion Neurons’? Investigating the Existence and Role,” Lee et al., ACL Findings 2025 (aclanthology.org/2025.findings-acl.806/). “Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words,” Saif M. Mohammad, ACL 2018 (publications-cnrc.canada.ca/eng/view/object/?id=166f8642-f65c-4b69-b736-1d2eaac3a094). “Map-like representations of emotion knowledge in hippocampal-prefrontal systems,” Yumeng Ma and Philip A. Kragel, Nature Communications, January 26, 2026 (nature.com/articles/s41467-025-68240-z).

The Bosch Brothers is written by Bala and Krishna Bosch at Keryx Solutions, where they work on AI integration, software architecture, and product delivery. More at keryxsolutions.com

The Bosch Brothers at Keryx Solutions

Suppression Builds Better Liars

Discussion about this post

Ready for more?