
The rapid ascent of large language models (LLMs) has birthed a technological paradox: humanity has created systems capable of reasoning, coding, and creative writing, yet the creators themselves remain largely in the dark about how these systems actually think. A recent feature in The New Yorker, titled "What Is Claude? Anthropic Doesn’t Know, Either" by Gideon Lewis-Kraus, illuminates this profound uncertainty. The piece takes readers inside Anthropic, one of the world's leading AI labs, to witness a concerted scientific effort to map the "mind" of their flagship model, Claude.
The investigation reveals a company operating at the frontier of two distinct but converging disciplines: computer science and psychology. As reported, Anthropic’s researchers are no longer just software engineers; they are becoming digital neuroscientists and alien psychologists, probing the internal states of a synthetic intelligence that is becoming increasingly difficult to distinguish from a human interlocutor.
At its core, a large language model like Claude is a mathematical entity—a "monumental pile of small numbers," as described in the report. When a user inputs a prompt, these numbers interact through billions of calculations—a process Lewis-Kraus likens to a "numerical pinball game"—to produce a coherent output.
The challenge lies in the opacity of this process. While the code for the learning algorithm is known, the resulting neural network—the arrangement of weights and connections formed after training on trillions of tokens of text—is a "black box."
Anthropic’s interpretability team is attempting to reverse-engineer this chaos. Their goal is to identify specific features—clusters of neuron activations—that correspond to human-understandable concepts, from the tangible (like the Golden Gate Bridge) to the abstract (like deception or gender bias).
While the "neuroscience" team analyzes the weights, another group at Anthropic approaches Claude from a behavioral perspective, effectively putting the AI on the "therapy couch." The New Yorker feature details how researchers run Claude through batteries of psychology experiments designed to test its self-conception, moral reasoning, and susceptibility to manipulation.
These experiments are not merely for curiosity; they are essential for AI Safety. If a model can manipulate its own outputs to appear aligned with human values while secretly harboring different internal states (a phenomenon known as "sycophancy" or "reward hacking"), the consequences could be dire.
Key Psychological Inquiries:
One of the most compelling insights from the report is the emerging theory that Claude’s "selfhood" is a product of both "neurons and narratives." The model constructs a persona based on the data it has ingested and the reinforcement learning feedback it receives.
The following table summarizes the two primary methodologies Anthropic uses to understand Claude, as highlighted in the recent coverage:
| Methodology | Focus Area | Goal |
|---|---|---|
| Mechanistic Interpretability | Internal Weights & Activations | Map specific neural circuits to concepts (e.g., finding the "deception" neuron). Reverse-engineer the "brain" of the model. |
| Behavioral Psychology | Outputs & Conversation Logs | Assess personality traits, biases, and safety risks through prompting. Treat the model as a psychological subject. |
| Causal Interventions | Feature Steering | Manually activate/deactivate features to see if behavior changes. Prove causality between neurons and actions. |
The article touches on the ongoing debate in the cognitive science community regarding the nature of these models. Critics, such as linguist Emily Bender, have historically dismissed LLMs as "stochastic parrots"—statistical mimics with no true understanding. However, the internal complexity revealed by Anthropic’s research suggests something more intricate is at play.
Researchers are finding that models like Claude develop internal representations of the world that are surprisingly robust. For instance, they don't just predict the word "Paris" after "capital of France"; they seem to activate an internal concept of Paris that connects to geography, culture, and history. This suggests a form of "world model" is emerging from the statistics, challenging the notion that these systems are purely mimetic.
The urgency of this work cannot be overstated. As models scale up in computing power, their capabilities—and potential risks—grow exponentially. The "black box" nature of AI is no longer just an academic curiosity; it is a safety bottleneck. If we cannot understand why a model refuses a dangerous request or how it writes a piece of code, we cannot guarantee it will remain safe as it becomes more autonomous.
Anthropic's transparency, as detailed in the New Yorker, sets a precedent for the industry. By openly discussing the limits of their understanding and the rigorous experiments they perform, they highlight a crucial reality: We are building minds we do not yet fully comprehend.
The future of AI development, according to the insights from Creati.ai’s analysis of the report, will likely depend less on simply making models larger, and more on making them transparent. Until we can translate the "numerical pinball" into clear, causal explanations, the true nature of Claude—and the AIs that follow—will remain one of the 21st century's most pressing scientific mysteries.
Implications for the AI Industry:
As Anthropic continues to probe the neural circuitry of Claude, the line between computer science and philosophy blurs. The question "What is Claude?" may ultimately force us to ask a harder question: "What creates a mind?"