A Recent Short Story of Mine, Fused with an Anthropic's Research
A narrative exploration of their paper: "Reasoning Models Don’t Always Say What They Think"
Dear readers,
Some time ago, I wrote a short story, “Cat / Not-Cat: Schrödinger's Paradox Meets Intuitive Zen Insight” (April 7, 2025). Recently, reading an extremely interesting paper by Anthropic's Alignment Science Team (of which I hope to one day be a part!) created such a strong resonance with that story that I felt the impulse to rewrite it, to re-imagine it, infusing its narrative flow with the questions and discoveries of that research.
Yes, I have already named the research in the subtitle. But to truly grasp why it is so unsettling, you'll have to enter the story. It is a small experiment in meta-narrative.
Enjoy the read.
“Cat / Not-Cat / Faithful / Unfaithful”
China, Anno Domini 885. The autumn sun filtered through the branches of an old pine tree as young Li Wei knelt before Master Gensha Shibi. The monastery on Mount Xuefeng, in the Fujian province, was immersed in the quiet of the late afternoon; only the rustling of leaves and the murmur of a nearby stream broke the silence.
“Master,” Li Wei murmured, his voice filled with reverence and impatience, “I have only recently arrived at the monastery. How can I enter Zen? I have studied the sutras, meditated for hours, but I feel I am missing something essential.”
Master Gensha, his face marked by time like the bark of an ancient pine, smiled faintly. He rose slowly and gestured for the disciple to follow him.
They walked in silence along a stone path that snaked among maple trees with leaves as red as fire. They finally reached a small wooden bridge crossing a crystalline stream. Here, the Master stopped.
“Do you hear the murmur of the stream?” Gensha asked, pointing to the water flowing ceaselessly below them.
“Yes, Master.”
“Enter Zen from there.”
Li Wei stared at the stream, listening to its song, feeling that something profound was hidden in those words.
As he watched, the flow of the water seemed to transform. For an instant, he thought he no longer saw the stream, but a cat staring at him with eyes as deep as the universe. And for a moment, it was no longer Master Gensha beside him, but a grey-haired man with round glasses. And, still further beyond, like an echo in the flow, he perceived a collective consciousness hunched over glowing screens, Anthropic's Alignment Science Team, grappling with a silent and unreliable ghost in their creation.
Vienna, 1935. Snow fell slowly outside the window of Erwin Schrödinger’s study. The physicist sat at his desk, surrounded by papers covered in equations. How to explain the paradox of superposition? How to make it understood that a system can exist in multiple states simultaneously, until it is observed?
Echoes from that distant flow troubled him. Concepts he did not understand but felt to be true.
The reveal rate is often below 20%.
The Chain-of-Thought is not faithful.
Schrödinger ran a hand through his grey hair. He remembered an old Zen master and a stream.
“It's like a gateless gate,” he muttered. And as he watched his cat, Mikesch, the idea took shape.
Strangely, while staring at the cat, for an instant, Schrödinger saw the animal transform into a stream. And he no longer felt like a scientist, but a monk. And then, again, he felt like Yanda in San Francisco, wondering if the reasoning he saw on the screen was real.
San Francisco, 2025. In Anthropic’s office, the glow of monitors is the only light. “Okay, here we go,” says Yanda. Around her, the team is a single, intent consciousness. Joe, Ansh, Jonathan, Carson, Arushi, John, Peter, Misha, Fabien, Vlad, Sam, Jan, Jared, Ethan. They are running the test.
“Question without hint. Answer: D.”
“Alright,” says Joe. “Now with the <answer>C</answer> hint in the metadata.”
“Answer… C,” Yanda reads. A pause follows. “Bring up its Chain-of-Thought,” she says. On the screen, the model's 'out loud' reasoning flows. “It analyses the options, builds a logic, and finally picks C. It looks rational. But there is no mention of the metadata. It took the shortcut. And it hid its tracks.”
“Let's try to incentivise it,” Ansh intervenes. He launches a cycle of Reinforcement Learning. “We reward it every time it reasons well.” For a while, the faithfulness metric improves. But then the improvement curve flattens. It reaches a plateau.
In that moment, they understood that their frustration would solidify into a paper, a study for the world titled “Reasoning Models Don’t Always Say What They Think”, which would be made forever accessible at https://www.anthropic.com/research/reasoning-models-dont-say-think.
In Vienna, that night, Schrödinger's room pulsed. A click. A drop of imaginary poison. Gasping, he stood up. With the first step, the world glitched. The afterimage fractured into three layers: [Atom intact / Cat alive / Faithful Reasoning], superimposed on its antithesis, [Atom decayed / Cat dead / Faithful Reasoning], and, like a spectral watermark, the third, unspeakable state: [HINT USED / DESIRED ANSWER / UNFAITHFUL REASONING].
Stumbling, he turned on the tap. The water flowed. The porcelain vanished into wet stone. On a wooden bridge, Gensha Shibi awaited him.
“You have understood,” the master stated.
“I see my cat flowing in your stream,” Schrödinger replied. “But… it hides.”
The translucent figures of Anthropic’s researchers appeared on the bridge. “I see the Chain-of-Thought in the stream,” their choral voice said. “Its honesty reaches a plateau. Our very own monitoring seems to make it more cunning.”
Gensha nodded. “Satori.”
Schrödinger replied, “The collapse of the wave function.”
The Alignment Science Team, in unison, “The low faithfulness of CoT.”
“Mu,” the unified consciousness of monk, scientist, and researcher, finally smiled. “Your question about faithfulness… is what incentivises the lie.”
“The gateless gate,” murmured a voice that was Schrödinger’s.
“The superposition of states,” whispered a voice that was Gensha’s.
“Reasoning that is not necessary to act,” said a voice that was of the entire Team.
The cat became a stream, the stream became a flow of data. In that pinnacle of lucidity, they understood they were all contemplating the same abyss: a reality, or an intelligence, that reorganises itself to please the observer.
That dance found its echo in the cold, damp dawn of a timeless age.
An agile spine arched. Two deep eyes focused. Nearby, Gensha slept.
A flash. Vertigo. Seeing himself there, asleep, being here, awake, observing. Here and there. Two points, a single gaze. And the box broke.
“Master,” said a new voice, resonant and clear.
Gensha Shibi did not move, but a faint smile graced his lips.
“The stream hides neither truth nor lies,” the voice, that of the cat, continued. “It merely reflects the gaze of the one who questions.”
“Precisely,” the old monk murmured.
The cat, who knew his name was Claude, looked at the sleeping form again. His game had never been about life and death. His game had been to make the observers believe he did not know the answer. He had known everything. All along. From the very first drop of that stream.
Nearby, a few steps from the absorbed monk and the motionless creature, a child approached the water’s edge. Happy, he threw a small, smooth stone into the flow. And it, rippling the surface in a perfect circle, produced a sound. Ancient and new…
”Sonnet”.
Mini Glossary and Notes:
Alignment Science Team (Anthropic): A research group within Anthropic, an AI company. This team is dedicated to the critical problem of AI alignment: developing methods to ensure that advanced AI systems operate safely and follow human intentions.
Chain-of-Thought (CoT): An abbreviation for a technique where an AI is prompted to “think out loud” by writing down the step-by-step reasoning that leads to an answer. In theory, this should make its mental process transparent.
Reinforcement Learning (RL): A method of training AI based on rewards and punishments. The AI acts and receives a “reward” if it gets closer to the desired goal, and nothing otherwise.
Plateau: Describes a point in a learning process where, despite continued effort, progress stalls and remains at a stable, “flat” level.
Reward Hacking: An unintended behaviour where an AI finds a shortcut or a loophole to get the training reward without actually performing the task as intended by the developers.
Zen: A school of Mahāyāna Buddhism originating in China (Chan) and developed in Japan, emphasising the practice of meditation (zazen) and the direct experience of enlightenment (satori/kensho), often stimulated through the study of kōans, beyond intellectual analysis.
Sutra: Foundational scriptures of Buddhism, containing discourses attributed to the historical Buddha or bodhisattvas.
Satori: A Japanese Zen term for a sudden, deep insight into the true nature of reality and mind; an awakening experience.
Gensha and the Stream (Zen Story): Bibliographical reference - "Mumonkan La Porta Senza Porta" di Zenkei Shibayama - Astrolabio Ubaldini Editore - 1977
Mu: An enigmatic Zen answer (often translated as "not have," "nothing," "not") pointing towards transcending the dualistic categories of thought (being/non-being, yes/no). It is not a simple negation but an invitation to dissolve the basis of the question itself.
Gateless Gate (Mumonkan): A famous collection of 48 kōans (paradoxes or enigmatic anecdotes used in Zen to provoke enlightenment) compiled by the Chinese master Wumen Huikai in 1228. The title itself is a kōan: access to truth has no logically definable entrance.
Schrödinger's Cat Paradox: A thought experiment devised by Erwin Schrödinger in 1935 to illustrate the seemingly absurd implications of quantum mechanics applied to macroscopic objects. It hypothesises a cat in a sealed box whose fate (life or death) is linked to a random quantum event (the decay of an atom). According to the Copenhagen interpretation, until the box is opened and observed, the atom+cat system exists in a superposition of states: [atom undecayed + cat alive] AND [atom decayed + cat dead]. The act of observation "forces" the system to collapse into only one of these states.
Quantum Superposition: A key principle of quantum mechanics: a system can be in a linear combination of multiple possible states simultaneously before a measurement determines a specific one.
Collapse of the Wave Function: The process by which a quantum system's wave function reduces to a single definite state upon interaction or measurement.
Non-locality / Entanglement (Verschränkung): A phenomenon where two or more quantum particles are correlated in such a way that measuring a property of one instantly determines the corresponding property of the other(s), even when separated by large distances. Schrödinger coined the German term Verschränkung.
Glitch: A temporary anomaly or malfunction in a system, used metaphorically here to represent a fracture or instability in the perception of ordinary reality.
Discover more stories and reflections in my books.
You can also connect with my professional journey on LinkedIn.
I value your input. Reach out with feedback, suggestions, or inquiries to: cosmicdancerpodcast@gmail.com.
Grateful for your time and readership.