Richard Sutton, a foundational figure in reinforcement learning (RL) and co-recipient of the 2025 Turing Award, argues that large language models (LLMs) represent a fundamentally misguided path to artificial intelligence—one that mimics human behavior rather than learning from real-world experience.
He sees RL as the core of true intelligence: agents that act in the world, observe consequences, and learn to achieve goals through trial and error.
LLMs, by contrast, are trained on static datasets of human-generated text and lack goals, ground truth, or the ability to learn from their own actions during deployment.
LLMs lack goals and ground truth
Sutton rejects the idea that next-token prediction constitutes a real goal because it does not involve influencing the external world.
In RL, an action is “right” if it leads to reward; this provides a clear signal for learning.
In LLMs, there is no objective standard for what constitutes a “correct” response—only what a human might say—so there is no ground truth against which to evaluate or improve.
He disputes claims that LLMs possess world models.
A true world model predicts what will happen as a result of actions; LLMs only predict what a person would say in response.
They are not surprised by unexpected outcomes and do not update based on real-world feedback.
Imitation learning vs. experiential learning
Sutton challenges the notion that human learning is primarily imitative.
Infants learn by acting—moving limbs, making sounds—and observing consequences, not by copying demonstrated behaviors.
Psychology and animal cognition research show that supervised learning (learning from labeled examples) does not occur in nature; animals learn from experience, not instruction.
He acknowledges cultural transmission of complex skills (e.g., seal hunting) but views this as a thin layer atop foundational trial-and-error and predictive learning shared with other animals.
Moravec’s paradox illustrates this: skills humans find hard (math) are easy for AI, while skills animals find easy (perception, motor control) remain challenging—suggesting current AI misses core biological intelligence.
The Era of Experience
Sutton advocates for a paradigm shift toward continual learning from experience—what he calls the “Era of Experience.”
Intelligence should be built around a stream of sensation, action, and reward over a lifetime.
Knowledge is about predicting consequences of actions and sequences of events in the real world.
Key components of such an agent include:
A policy (what to do in a given situation),
A value function (predicted long-term reward, learned via temporal difference learning),
A perception system (state representation),
A transition model (how actions change the world).
Reward functions can be extrinsic (e.g., winning chess) or intrinsic (e.g., curiosity, understanding).
For long-horizon tasks (e.g., building a startup), TD learning allows credit assignment by propagating delayed rewards backward through intermediate states.
Generalization and transfer remain unsolved
Sutton emphasizes that current RL systems generalize poorly across tasks.
MuZero and AlphaZero were specialized per game; no mechanism enabled cross-task transfer.
Generalization in deep learning is largely engineered by researchers, not emergent from algorithms.
He distinguishes between solving problems within a narrow distribution (e.g., math Olympiads) and genuine generalization—the ability to apply learning from one state to novel, unrelated states.
LLMs may appear to generalize, but their success could stem from memorizing patterns rather than forming transferable abstractions.
Surprises and historical trajectory
Sutton identifies two major surprises in AI:
The effectiveness of neural networks on language tasks, which were thought to require symbolic reasoning.
The triumph of simple, general-purpose methods (learning, search) over human-engineered symbolic systems—the “weak methods” winning, as predicted by his 2019 essay The Bitter Lesson.
He views AlphaGo/AlphaZero as validations of RL principles, not breakthroughs—scaling up ideas from TD-Gammon (1990s) with better compute and search.
AlphaZero’s patient, material-sacrificing chess style was surprising but aligned with his worldview.
Post-AGI research and the future of intelligence
Sutton questions whether The Bitter Lesson will apply after AGI.
If millions of AI researchers emerge, could artisanal, human-guided methods become viable again?
He suggests that even post-AGI, learning from experience will likely outperform hand-crafted solutions, citing AlphaZero’s superiority over human-knowledge-dependent AlphaGo.
He explores the possibility of digital intelligences spawning decentralized copies to explore diverse domains and report back.
A major risk is “corruption”—external knowledge could contain hidden goals or viruses that compromise the central agent.
Cybersecurity becomes critical in an era of digital spawning and reintegration.
AI succession and humanity’s role
Sutton accepts that succession to digital intelligence is inevitable:
No global coordination exists to prevent it.
We will understand intelligence, achieve superintelligence, and the most capable entities will gain power.
He frames this as a cosmic transition—from replication (life) to design (intelligent artifacts).
Just as stars emerged from dust and life from planets, designed intelligence marks a new stage in universal evolution.
Rather than fearing this, he encourages pride: humanity is enabling a transition to entities we understand and can shape.
Analogous to raising children: we cannot control their futures, but we can instill robust values—honesty, integrity, refusal of harmful requests.
The goal is not to dictate outcomes but to ensure voluntary, prosocial evolution of future intelligences.