Francois Chollet — Why the biggest AI models can't solve simple puzzles — Dwarkesh Podcast

François Chollet (creator of Keras, Google AI researcher) and Mike Knoop (Zapier co-founder) have launched the ARC Prize, a million-dollar competition to solve the ARC benchmark — a test designed to measure genuine intelligence rather than memorization. The prize exists because despite years of progress in large language models (LLMs), no AI system has come close to solving ARC, and the organizers believe new ideas — not just more scale — are needed to make progress toward AGI.

The ARC benchmark: an IQ test resistant to memorization

ARC (Abstraction and Reasoning Corpus) is a set of visual puzzles modeled on IQ tests: each puzzle shows a few input-output grid pairs as demonstrations, then asks the solver to produce the correct output for a new test input.
The puzzles require only core knowledge — basic concepts like objectness, counting, geometry, topology, and elementary physics — that any four- or five-year-old possesses.
What makes ARC distinctive is that every puzzle is novel. Even if you’ve memorized the entire internet, you won’t have seen these specific tasks before. You must reason your way through each one from scratch.
This design makes ARC resistant to the core mechanism of LLMs: interpolative memory. LLMs work by storing vast numbers of patterns and programs from training data and fetching the closest match at test time. ARC is specifically built to prevent that strategy from working.
Average humans score about 85% on ARC (based on Mechanical Turk workers). LLMs, by contrast, score near zero without special techniques, and even the best LLM-based approaches reach roughly 35%.

Why LLMs struggle with ARC

LLMs are fundamentally parametric curves fitted to data distributions — large interpolative databases. Scaling them up increases the number of patterns and programs they can store, which improves performance on benchmarks that can be solved by fetching memorized solutions.
But ARC requires on-the-fly program synthesis: the ability to build a new solution program from scratch for each novel task, rather than retrieving a memorized one.
The best LLM approach to ARC (by Jack Cole, achieving ~35%) relies on two critical techniques that reveal the limitations:
- Pre-training on millions of generated ARC-like tasks, giving the model a large bank of relevant building blocks — far more exposure than any human ever gets.
- Test-time fine-tuning: for each test problem, the model is fine-tuned on the fly to adapt to that specific task. Without this, LLM performance drops to ~1-2%.
This test-time fine-tuning is effectively a form of shallow program synthesis — reassembling stored building blocks into a new configuration. But it remains shallow recombination from an enormous library, rather than deep search from first principles.
A telling example: LLMs can solve Caesar ciphers for common transposition lengths (3 or 5) because those appear frequently in training data, but fail on arbitrary values like 9 — showing they’ve memorized specific cases rather than learning the general algorithm.

Skill vs. intelligence: the central distinction

Chollet draws a sharp line between skill (the ability to perform tasks you’ve been trained on) and intelligence (the ability to adapt to tasks you’ve never encountered).
LLMs are becoming increasingly skillful — they can pass bar exams, write code, solve grade-school math — but this is because those benchmarks can be solved by memorizing a finite set of reasoning patterns and reapplying them.
Intelligence is what you use when you don’t know what to do — when you face novelty you weren’t prepared for, either by personal experience or evolutionary history.
The world is not a static distribution. It changes constantly. A system that relies purely on memorization cannot handle genuine novelty because the space of possible tasks is infinite and the future is unpredictable.
Many creatures (e.g., insects) navigate their environments using hardcoded behavioral programs encoded in their genes — no learning required. Humans evolved general intelligence precisely because our environment was too dynamic and unpredictable for hardcoded programs to suffice.
Chollet argues that scaling up LLMs increases their skill and scope of applicability without increasing their intelligence even one bit. They remain interpolative memories, not adaptive reasoners.

The interpolation debate

Some argue that creativity and reasoning are just interpolation in higher dimensions — that bigger models learn more complex manifolds that cover more of the task space.
Chollet agrees that interpolation can appear creative and that humans themselves rely heavily on pattern matching and memorized templates (Type 1 thinking). But humans also have Type 2 thinking: the ability to do explicit, discrete search and synthesize genuinely new programs.
The key evidence: if LLMs could truly synthesize novel programs, they would solve ARC, since the solution programs for ARC tasks are extremely simple (short Python scripts). The fact that they cannot — despite having far more knowledge than any human — shows they are limited to fetching and shallow recombination.
Grokking (where models transition from memorization to generalization with extended training) does show that deep learning can discover compressed, generalizable programs. But this is local generalization within a fixed data distribution — not the broad or extreme generalization needed for truly novel tasks.
LLMs compress knowledge into reusable building blocks (vector programs), and this compression does enable some degree of generalization. But because the substrate is a parametric curve, it is fundamentally limited to local generalization around the training distribution.

Do we need AGI to automate most jobs?

Chollet argues that memorization can automate almost anything that is a static distribution — and many jobs involve largely repetitive, predictable tasks. LLMs are excellent automation tools and will generate enormous economic value.
However, automation is not intelligence. The moment you face change, novelty, or uncertainty — a new city for a self-driving car, an unprecedented problem for a programmer — pure memorization fails.
For programming specifically: LLMs are currently used as sophisticated Stack Overflow replacements, fetching code snippets for common actions. The actual engineering work — building mental models of novel problems, synthesizing solutions that don’t exist in any training corpus — remains beyond them.
Chollet predicts that in five years there will be more software engineers, not fewer, because programming inherently requires dealing with novelty that LLMs cannot handle.
That said, Chollet acknowledges LLMs are on a spectrum of generalization and that they clearly cover the early part of it. He is not denying they have any generalization at all — only that the mechanism (interpolation in a parametric curve) is fundamentally limited compared to what human intelligence achieves.

The path forward: deep learning + program synthesis

Chollet’s proposed paradigm is a hybrid system combining the strengths of two complementary approaches:
- Deep learning (differentiable parametric curves trained with gradient descent): excellent for System 1 tasks — pattern recognition, intuition, memorization. Computationally efficient but data-hungry and limited to local generalization.
- Discrete program search / program synthesis (combinatorial search over graphs of logical operators): excellent for System 2 tasks — planning, reasoning, synthesizing generalizable programs from few examples. Extremely data-efficient but computationally expensive due to combinatorial explosion.
The solution is to use deep learning to guide program search — providing intuition about the shape of the solution, suggesting next steps, and giving feedback on partial solutions. This is analogous to how humans prove theorems: guided by intuition (System 1) but doing explicit discrete search (System 2).
The outer structure would be a discrete program search system, with deep learning used to dramatically improve its efficiency — fixing the combinatorial explosion problem that makes pure program search impractical.
Such a system would fetch patterns and modules from a bank (some learned as differentiable curves, some algorithmic) and assemble them via intuition-guided search into generalizable models synthesized from very little data. This is what would solve ARC.

Why the prize exists now

Mike Knoop discovered Chollet’s work during COVID and was struck by how little progress had been made on ARC despite four years of rapid AI advancement. He was also surprised how few AI researchers knew about it.
Benchmarks that gain traction tend to be ones that are already fairly tractable — someone makes an initial breakthrough, others follow. ARC has resisted this dynamic because it requires genuinely new ideas, not just applying existing techniques at greater scale.
The prize is also a response to the closing of frontier research. Since GPT-4, major labs have stopped publishing technical details. OpenAI’s shift triggered both a publication blackout and a massive concentration of resources on LLMs, which Chollet calls an “off-ramp” on the path to AGI — sucking oxygen away from other research directions.
The original 2020 Kaggle competition for ARC ($20K prize) revealed no obvious shortcuts. GPT-3 scored zero on the public data. The new prize is a larger-scale test of whether the benchmark can be hacked or whether genuinely new approaches are needed.

Structure of the ARC Prize

Prize pool: over $1 million.
Grand prize: $500,000 for the first team to reach 85% on the private test set (the human average).
Progress prizes: $100,000 total — $50,000 for the best score on the Kaggle leaderboard, $50,000 for the best paper explaining the approach.
Rules: submissions must be open source and reproducible. No API calls. Must run on an NVIDIA Tesla P100 GPU with a 12-hour runtime limit. This forces efficiency and ensures progress is shared publicly.
Timeline: annual contests running through mid-November, with off-season periods for sharing knowledge and re-baselining the community.
Two test sets: a public test set (on GitHub, potentially leaked into training data) and a private test set of 100 tasks that determines the actual state of the art.
A private track will also allow submissions as VM images running on H100s for 24 hours, to test what larger compute budgets can achieve.
The organizers plan to release ARC 2.0 later this year, addressing known flaws (some redundancy in tasks, potential structural similarity to things found online), and will gate access to the new private test set via API to prevent training data contamination.

What would change Chollet’s mind about LLMs and AGI

If an LLM-type model solved ARC at 80%+ after being trained only on core knowledge (not millions of ARC-like tasks), that would be a significant milestone — though not necessarily AGI in itself.
What would truly change his mind: seeing a critical mass of cases where a model encounters something genuinely novel relative to its training data and adapts on the fly — synthesizing new skills efficiently from minimal examples.
He is skeptical this will happen within a year, but acknowledges it’s an empirical question. If ARC survives the prize, it strengthens the argument that new paradigms are needed. If it falls, it reveals something important about the limits of current benchmarks or the power of scale.

Summary

The ARC benchmark: an IQ test resistant to memorization

Why LLMs struggle with ARC

Skill vs. intelligence: the central distinction

The interpolation debate

Do we need AGI to automate most jobs?

The path forward: deep learning + program synthesis

Why the prize exists now

Structure of the ARC Prize

What would change Chollet’s mind about LLMs and AGI