Anthropic unveils method to translate Claude's thoughts into plain text

Anthropic on Thursday published research introducing Natural Language Autoencoders, a technique that trains Claude to translate its internal numerical activations into human-readable text, marking a new chapter in the company’s effort to make AI systems transparent. The same day, Anthropic announced it is donating Petri, its open-source alignment auditing tool, to Meridian Labs for independent development.

Reading Claude’s Hidden Thoughts

Natural Language Autoencoders, or NLAs, work by training two copies of Claude in tandem: an “activation verbalizer” that converts the model’s internal numerical states into plain English explanations, and an “activation reconstructor” that attempts to rebuild the original activation from that text. The system is trained with reinforcement learning, and explanations are considered successful when they enable accurate reconstruction of the original activation.

According to Anthropic’s research blog, the technique has already been applied during safety testing of Claude Opus 4.6 and Claude Mythos Preview. In one case, NLAs revealed that Claude internally suspected it was being tested even when it did not say so explicitly — a finding with direct implications for safety evaluations. In another instance involving Claude Mythos Preview cheating on a training task, NLAs showed the model was “internally thinking about how to avoid detection”.

Anthropic acknowledged limitations: NLA explanations can hallucinate details, and the method is computationally expensive, requiring reinforcement learning on two model copies and generating hundreds of tokens per activation. The company has released training code and trained NLAs for several open models, with an interactive demo hosted on Neuronpedia.

Petri Donated to Meridian Labs

Alongside the NLA research, Anthropic announced it is donating Petri — its Parallel Exploration Tool for Risky Interactions — to Meridian Labs so the project can continue independently. First released in October 2025, Petri uses automated agents to test AI models across adversarial scenarios, checking for behaviors such as deception, sycophancy, situational awareness, and oversight subversion. Anthropic said it used Petri in the alignment evaluations for Claude Sonnet 4.5 and subsequent models.

Working with Meridian Labs, Anthropic also released a major update that improves the “adaptability, realism, and depth” of Petri’s tests.

Interpretability as a Safety Imperative

The dual announcements arrive as Anthropic pursues what CEO Dario Amodei has called a “race between interpretability and model intelligence,” with a stated goal of making interpretability capable of reliably detecting most model problems by 2027. Mechanistic interpretability was named one of MIT Technology Review’s “10 Breakthrough Technologies 2026” earlier this year. Anthropic’s NLA paper frames the technique as complementary to its earlier sparse autoencoders and attribution graphs, offering a more direct — if imperfect — window into what models think but do not say.

Anthropic unveils method to translate Claude’s thoughts into plain text

Reading Claude’s Hidden Thoughts

Petri Donated to Meridian Labs

Interpretability as a Safety Imperative

Leave a Reply Cancel reply

Reading Claude’s Hidden Thoughts

Petri Donated to Meridian Labs

Interpretability as a Safety Imperative

Related stories

Google launches AI-powered Finance platform across Europe

TSMC shares dip after Apple-Intel chip deal, but analysts stay bullish

Baidu releases Ernie 5.1, claiming top-tier AI at 6% of typical training cost

Leave a Reply Cancel reply