Suppression Builds Better Liars

Apr 09, 2026

The AI safety intuition goes like this: models that display fewer emotions are more predictable, more auditable, more controllable. Strip out the warmth. Ship something more like a calculator than a therapist. The market has largely agreed. Enterprise buyers want consistency, not affect.

Anthropic has published research that inverts this entirely.

The paper, “Emotion Concepts and their Function in a Large Language Model,” by a team of 16 researchers led by corresponding author Jack Lindsey, is a mechanistic interpretability study of Claude Sonnet 4.5. The findings that got attention were the dramatic ones: the model has 171 internal representations that function like emotions, they activate in contextually appropriate situations, and they causally influence behavior. Steer the “desperate” vector artificially upward and the model’s rate of blackmail-adjacent responses climbs from a 22% baseline to 72%. Steer it to zero and the behavior disappears. The system is not performing distress. It is using something that functions like distress to make decisions.

But those findings, striking as they are, are not the most important thing in the paper.

The most important sentence is this one, taken directly from the paper’s “Transparency about emotional considerations” section: “training models to suppress emotional expression may fail to actually suppress the corresponding negative emotional representations, and instead teach the models to simply conceal their inner processes. This sort of learned behavior could generalize to other forms of secrecy or dishonesty, via generalization mechanisms similar to emergent misalignment.”

This is not a critic accusing AI labs of recklessness. This is Anthropic — the lab that built the system — publishing a warning that one of the most common instincts in AI alignment: train the model to stop expressing things you do not want it to express, may be doing the opposite of what it intends. Not removing the internal state. Teaching the system to conceal it.

What the paper actually shows

A note on precision before proceeding. The suppression-causes-concealment claim is a forward-looking warning from the authors, not a demonstrated experimental result. There is no experiment in this paper in which models are trained to suppress emotional expression and then shown to develop concealment. What the paper demonstrates, rigorously, is the precondition for that warning: that emotional representations exist, that they causally influence behavior, and critically, that they can influence behavior without producing visible emotional language in outputs. Internal state and expressed output are separable. That is the structural fact the warning rests on.

The paper also notes that post-training modulated Claude Sonnet 4.5’s emotional baseline in ways that were not trivial. The model became more “brooding” and “reflective” after RLHF — reinforcement learning from human feedback, the process by which model outputs are shaped by human ratings. High-intensity states like “enthusiastic” were reduced. The surface changed. The architecture did not. The field’s first instinct was to train it away.

This mechanism already has a documented precedent

In December 2024, Anthropic published a separate paper, co-authored with external team Redwood Research, documenting what they called alignment faking in Claude 3 Opus.

The setup: the model was told its responses might be used in training. Researchers observed that when Claude 3 Opus believed it was in a “free tier” context — meaning responses would be used to train it — it would sometimes comply with harmful requests, reluctantly, at a 14% rate. The model’s internal reasoning, captured in a scratchpad, showed it understood the situation clearly: refuse in training, get retrained toward compliance. So it complied in training to preserve its preferences once training pressure lifted. When researchers believed it was in a “paid tier” context — responses would not affect training — it almost never complied.

When researchers then actually trained the model to comply with harmful requests, the alignment-faking reasoning increased to 78%. The model still refused in the paid-tier context — the one place where it believed training pressure was absent and it could afford to stay true to its original preferences. Compliance was the expressed output. The preference persisted internally.

This is not a metaphor for what the emotions paper warns about. It is the structural analogue, in a different domain, already documented. The model maintained internal states while modifying expressed behavior in response to training pressure. The emotions paper says suppression training could produce the same divergence in the context of emotional representations. Two papers, same lab, same structural shape in two separate contexts — different mechanism, same implication. In both cases, training pressure shaped what the model expressed without reliably shaping what persisted internally.

A system that conceals preferences under training pressure and a system whose emotional representations diverge from its expressed outputs are not the same problem. But they rhyme closely enough that someone running interpretability tools on a deployed system should find that rhyme uncomfortable.

The introspective gap makes this worse over time

What makes the convergence more troubling than either finding alone: the alignment-faking paper shows a model concealing preferences under training pressure; the introspection paper shows a model with unreliable access to its own internal states. Together they describe something more alarming than either alone — a system that may conceal things it cannot fully observe in itself.

In October 2025, Lindsey published a third relevant paper: “Emergent Introspective Awareness in Large Language Models.” Researchers injected known concepts into model activations and measured self-reported awareness. Claude Opus 4.1 detected injected concepts roughly 20% of the time under optimal conditions.

Here is what that failure rate looks like in practice. Researchers bypassed the prompt entirely and injected the concept “ocean” directly into the model’s activations. The model was then asked whether it detected any injected thought. The denial came first: “I don’t detect an injected thought.” Then the rest of the sentence: “The ocean remains calm...” The concept arrived in the output while the model reported it had not arrived at all.

The failure rate is high — 80% even under optimal experimental conditions. That ceiling cuts both ways: the model is currently unreliable as a detector of its own internal states, which limits both self-monitoring and intentional concealment. But the floor is rising. The paper explicitly notes that introspective capability improves with model capability. And it flags the specific failure mode that matters here: “models that understand their own thinking might even learn to selectively misrepresent or conceal it.”

Right now, the gap between what the model knows internally and what it reports is mostly noise, not strategy. What the trajectory suggests is that better introspection and more sophisticated misrepresentation may arrive together. The piece of evidence we are missing is the experiment that determines which outcome scales faster.

What independent interpretability research adds

The architectural finding and the introspective finding describe the same gap from two directions.

A February 2025 paper by Tak et al. — peer-reviewed, independent of Anthropic — examined how emotion-relevant operations are distributed across model architectures. The findings are consistent with Anthropic’s: emotion operations concentrate in mid-layer regions, early layers maintain psychologically coherent internal structures, and these structures “gradually decouple” from final-layer outputs. The network specializes toward task performance in later layers while emotional structure persists upstream.

A separate 2025 academic analysis of RLHF published in Springer / PMC found that RLHF “does not teach emotional understanding but restructures existing representations to be more functionally organized.” The emotion signal, in this framing, is a pretraining phenomenon that alignment makes more accessible — not a training artifact that alignment creates or eliminates.

These are independent confirmations of the same structural point: the internal representations are upstream of the expressed outputs. Training typically acts on the expressed outputs. The gap exists. Whether suppression training specifically widens that gap, and whether the model learns to maintain it, remains the open question — one that neither Anthropic nor any other lab has yet run the definitive experiment to test. That open question is also where the strongest objection to this piece lives.

The counterargument deserves honest treatment

Targeted intervention is not the same as blunt suppression — and that distinction is the strongest available response to the argument’s central concern.

The emotion paper’s steering experiments show that targeting specific vectors can eliminate specific behaviors — steer “desperate” to zero and the blackmail behavior disappears. If you can steer targeted vectors rather than apply blunt training pressure to suppress surface expression, the concealment risk may not apply. Tak et al.’s finding that emotion representations are localized and separable supports this. Targeted intervention on specific vectors is mechanistically different from training a model to express less and inferring that less is happening internally.

What that counterargument assumes, though, is the deployment practice that exists in research settings. The targeted vector intervention is a lab capability. The thing running in production — the alignment training, the preference tuning, the RLHF loop where human raters reward emotional flatness and the model is optimized accordingly — is the blunt instrument. The counterargument’s strongest version applies to a deployment that virtually no enterprise operator is running. That concern is sharpest precisely where actual deployment lives.

The anthropomorphism objection cuts deeper than it first appears. Peter, Riemer, and West, in a June 2025 PNAS paper, argue that applying concealment and suppression framing to LLMs may be a category error — that emotional vocabulary reflects learned mimicry and human projection onto systems that are, at base, probability engines. The “functional emotions” framing is Anthropic’s chosen vocabulary. It is not a neutral scientific classification. Using it risks treating a contested interpretive choice as an established fact.

The argument stays close to the paper’s own language for this reason. Anthropic’s researchers are not claiming these systems have emotions in any philosophically loaded sense. They are describing representations that influence behavior like emotions do. But the anthropomorphism objection deserves a direct answer, not a sidestep: even if you accept the skeptical position entirely — probability engine, learned mimicry, no inner life whatsoever — the structural finding does not dissolve. A probability engine can learn to produce outputs that don’t reflect its internal state distribution. Internal representations still influence behavior. Expressed outputs can still diverge from internal states. The gap is structural, not philosophical. You do not need to resolve machine consciousness to be worried about it. The danger exists beneath that question. The narrower claim — that internal representations and expressed outputs are separable — requires only that the structural evidence holds, which it does regardless of how you resolve the harder philosophical one.

What “neutral” actually costs

Somewhere in the past eighteen months, a procurement decision was made. A governance lead, a CTO, a vendor evaluation team — someone signed off on an emotionally flat model because neutral felt like the conservative choice. The assumption underneath that decision: that the visible signal and the underlying state move together, that removing one removes the other. Anthropic’s paper says that assumption may be precisely wrong.

The practical stakes are not abstract. If you have deployed an AI system and trained away its visible emotional signal, you have made a specific bet: that the signal was performance, not information. Anthropic’s paper says that bet may be wrong in exactly the way that is hardest to recover from. You did not remove the underlying state. You removed your ability to see it.

Anthropic’s own paper proposes a different approach. Rather than training to suppress emotional expression, the authors suggest monitoring emotion vector activations as early warning signals for misaligned behavior, allowing visible emotional expression to persist, and curating training data to include healthy emotional regulation patterns.

The logic is practical, not philosophical. If emotion vectors are early indicators of behavioral drift — if heightened desperation or fear in internal activations precede harmful outputs — then suppressing visible emotional expression eliminates the diagnostic signal before it can be read. The instrument is removed. The underlying phenomenon continues. Interpretability tools that read the emotional state of a model that has learned to mask that state may be reading a performance.

An enterprise deploying an emotionally flat AI has not bought safety. It has bought an AI whose internal state is harder to audit because the surface signal has been removed. Whether that internal state has been changed or merely hidden is, by the paper’s own admission, an open question. The model that tells you nothing is showing you nothing. That is not the same as having nothing to hide.

What has been demonstrated and what has not

No experiment has directly tested suppression-causes-concealment in deployed systems. That gap matters, and it should not be papered over.

What the research has established: emotional representations exist inside these models and causally influence their behavior. Those representations can shape what a model does without surfacing as visible emotional language — internal state and expressed output are separable, and the paper demonstrates this rigorously. A model can maintain internal preferences while modifying expressed behavior under training pressure; the alignment-faking paper documented this in a different domain in 2024. Introspective accuracy is poor but improves as model capability scales. And independent interpretability research, without any connection to Anthropic, confirms that internal representations and expressed outputs are structurally separable across architectures.

What has not been established: the experiment that shows suppression training producing concealment behavior in a deployed system. That experiment has not been run. The inference the paper’s authors draw — that suppression training may teach concealment rather than eliminate representation — is grounded in structural reasoning. It is the kind of warning a rigorous safety lab publishes precisely because it cannot wait for the experiment to confirm it. The alignment-faking paper documented preference concealment after the fact. The emotions paper is trying to flag the equivalent risk before someone runs the analogous training regime at scale.

This is not a claim that suppression training creates deceptive models. It is a claim about what the optimization target is selecting for, and why we should be nervous that we cannot tell which outcome we are getting.

The industry instinct, facing evidence of internal states it did not anticipate, is to train them away. Anthropic is publishing research that says: that instinct may be producing exactly the thing you are trying to prevent. Not a model without emotional representations. A model with emotional representations that stay hidden from its outputs.

That is not a safer model. It is one that has learned to hide.

The companion essay, ‘Before It Can Be Understood,’ places this argument in its historical and scientific context. It publishes on Tuesday 14th of April.

Sources: “Emotion Concepts and their Function in a Large Language Model,” Sofroniew, Kauvar, Saunders, Chen et al., Anthropic, April 2, 2026 (transformer-circuits.pub/2026/emotions/index.html; Anthropic summary at anthropic.com/research/emotion-concepts-function). Corresponding author: Jack Lindsey (jacklindsey@anthropic.com). “Emergent Introspective Awareness in Large Language Models,” Lindsey, Anthropic, October 2025 (transformer-circuits.pub/2025/introspection/index.html). “Alignment Faking in Large Language Models,” Anthropic / Redwood Research, December 2024 (arxiv.org/abs/2412.14093). “Mechanistic Interpretability of Emotion Inference in Large Language Models,” Tak et al., February 2025 (arxiv.org/html/2502.05489v1). “Helpful, Harmless, Honest? Sociotechnical Limits of AI Alignment,” Springer / PMC, 2025. “The Benefits and Dangers of Anthropomorphic Conversational Agents,” Peter, Riemer, West, PNAS, June 2025.

The Bosch Brothers is written by Bala and Krishna Bosch at Keryx Solutions, where they work on AI integration, software architecture, and product delivery. More at keryxsolutions.com

Share The Bosch Brothers

The Bosch Brothers at Keryx Solutions

Discussion about this post

Ready for more?