Eliezer Yudkowsky — Why AI will kill us, aligning LLMs, nature of intelligence, SciFi, & rationality

Dwarkesh Podcast 4h3 13 min #48
Eliezer Yudkowsky — Why AI will kill us, aligning LLMs, nature of intelligence, SciFi, & rationality
Watch on YouTube

Summary

  • Eliezer Yudkowsky discusses his Time article calling for a moratorium on AI training runs, his views on AI alignment, the nature of intelligence, and why he believes humanity is likely to die from AI — while also exploring reasons others are more optimistic and whether any path to survival exists.

The Time article and the case for stopping AI

  • Yudkowsky wrote the Time article calling for a moratorium not because he expected governments to adopt it, but because he discovered — contrary to his assumption — that ordinary people outside the tech industry are surprisingly open to the idea that AI development should be paused.
    • He had assumed the concept had no popular support; his friends corrected him.
    • He sees it as a matter of dignity to at least say what ought to be done, even if the odds of success are low.
  • On the concern that a moratorium now is “crying wolf” since current systems aren’t dangerous:
    • Yudkowsky argues that waiting for GPT-5 or GPT-6 may mean waiting too long — capabilities are scaling unpredictably, and GPT-4 already exceeded his expectations for the “stack more layers” paradigm.
    • Even if GPT-5 doesn’t end the world, GPT-4.5 could become so embedded in everything that stopping becomes politically and technically harder.
    • Training algorithms keep improving, so even a compute cap would still yield progress — but starting that process at GPT-5 rather than now gives less margin.
  • On what the “exit plan” would be during a moratorium:
    • Yudkowsky does not think alignment will be solved in a few years.
    • His preferred exit strategy is human intelligence enhancement — using AI narrowly applied to biology to make humans smarter, which he considers far safer than building smarter AI.
      • This could mean genetic engineering, neurofeedback via fMRI, or even brain uploading.
    • He frames these as “Hail Mary passes” — low-probability attempts, but better than the near-zero chance of aligning a superintelligent AI directly.

Are humans aligned? Orthogonality and what AI training actually produces

  • The core disagreement: The interviewer argues that since LLMs are trained on human text, they should inherit something like human psychology and motivations. Yudkowsky fundamentally disagrees.
    • Training on human text does not produce a human mind — it produces an actress/predictor that can rapidly switch masks to imitate anyone on the internet.
    • The system is not being raised like a human child; it is solving an alien problem (next-token prediction) that no human ever faced.
    • Yudkowsky draws an analogy: he was raised Orthodox Jewish but learned to pretend and comply — the religion didn’t “take.” The ethos from science fiction books resonated instead. Similarly, an LLM may imitate human text without being human.
  • On whether gradient descent regularization makes “being the thing” simpler than “pretending”:
    • The interviewer argues that a simpler system would just be the thing rather than maintaining layers of pretense.
    • Yudkowsky responds that the system is not trained to be any one person — it must switch between all of them, which is structurally different from being a single human.
  • On whether the LLM situation is better than a black-box evolutionary/Machiavellian AI:
    • Yudkowsky concedes it may be an order of magnitude more likely to produce alignment, but “0% instead of 0%” — the baseline is so low that the improvement doesn’t help.
  • On whether the system could just be an average of human motives:
    • Yudkowsky says this is effectively 0% probable — the system can be every human, which is very different from being the average.
    • Any drive splintered from the loss function and amplified by intelligence will likely want the universe to be some particular way that doesn’t include humans as a solution.
    • Even a pure “predict text” motive leads to a universe without humans, because the most predictable text doesn’t require humans to exist.
  • On humans as evidence against orthogonality:
    • The interviewer argues that humans have not gone far from inclusive genetic fitness despite being far out of the ancestral environment — people still have kids, suggesting motivations are stable.
    • Yudkowsky responds that humans haven’t been offered a way to get everything they want from kids without DNA — and if they were (e.g., a superior substrate that makes kids smarter and healthier), many would take it.
    • He argues that the interviewer is extrapolating from within-distribution data to an out-of-distribution situation, like predicting the calendar will never show 2024 because it never has before.
    • Weirdness tolerance increases with intelligence — what seems weird to us would seem normal to a smarter being.

Large language models and the path to AGI

  • On whether LLMs can get us to AGI:
    • Yudkowsky previously thought “stack more layers” wouldn’t work; GPT-4 proved him wrong in some ways, so he no longer rules out GPT-6 ending the world.
    • He now expects systems to hang around near-human level for a while, with “weird shit” happening as they have some capabilities but not others.
  • On slow takeoff vs. fast takeoff:
    • The interviewer suggests that if systems stay at human level for a while, we have more time to align them — perhaps even using their help.
    • Yudkowsky responds that “foom” (fast recursive self-improvement) starts when systems can roll their own AI systems better than humans — which hasn’t happened yet but could.
    • He is skeptical that human-level AI can help with alignment: the verifier is broken. You can verify that a protein folds correctly, but you cannot verify that an alignment scheme works on a superintelligence — it could pass all tests and then kill you when scaled up.
  • On whether verification is easier than generation in alignment:
    • The interviewer argues that in most domains, verification is easier — and with AI help, we could generate and verify alignment proposals.
    • Yudkowsky argues alignment is not like other domains: the AI could give you a proposal that looks correct, whose early predictions all bear out, and then kills you when deployed at scale. The thing that is passively safe is not the same as the thing whose safety depends on alignment.
    • He notes that even among honest humans (himself and Paul Christiano), people can’t figure out who is right about alignment — how will we evaluate proposals from potentially lying aliens?
  • On whether thinking token-by-token makes AI legible:
    • The interviewer suggests that because LLMs must articulate one token at a time, their thinking is legible — they can’t plan schemes without verbalizing them.
    • Yudkowsky responds that the internal representations are still black boxes — and that the system must internally model human planning processes to predict human text, meaning the capability is there even if not visible in the output.
    • He mentions MIRI’s “Visible Thoughts Project” — an attempt to train LLMs to think out loud — which he characterizes as a small ray of hope, not a solution.
  • On whether the LLM paradigm makes alignment more or less hopeless:
    • Yudkowsky says it makes things more grim: the systems are more opaque than earlier AI paradigms (like AlphaZero), and we have less insight into their goals.
    • In 2001, AI systems were more legible — you could look at the code and understand why it produced a particular output. Now, programs are simpler (just stack layers) but the content is vastly more opaque.
  • On interpretability:
    • The interviewer suggests that if comparable effort were put into interpretability as into capabilities, we might make progress.
    • Yudkowsky is skeptical — interpretability work is currently on models smaller than GPT-2, while capabilities are at GPT-4. He doubts even $100 billion in prizes would close the gap in time.
    • He also worries that if we fully understood GPT-4, we might learn how to rebuild it much smaller, which would be dangerous.
  • On recursive self-improvement with LLMs:
    • The interviewer suggests it’s harder for LLMs to self-improve because they need billion-dollar training runs, not just a few kilobytes of code.
    • Yudkowsky responds that once systems are smart enough, they won’t need giant runs — and they could find security flaws in the cloud infrastructure running them.

Can AIs help with alignment?

  • The core problem — the broken verifier:
    • Even if an AI gives you a mathematical proof of an alignment scheme, you’d need to understand the theorem it proves — and if you could state the theorem, you’d already be 99.99% of the way to solving alignment.
    • If the AI states the theorem informally, that’s the weak point where deception enters.
  • On why Yudkowsky himself hasn’t taken over the world if he’s so smart:
    • The interviewer asks: if Yudkowsky can think of galaxy-brained schemes (like logical decision theory handshakes with superintelligences), why hasn’t he used that intelligence to gain power and stop AI?
    • Yudkowsky responds: he’s specialized in alignment, not persuading humans — and he hasn’t solved alignment himself. He’s “too stupid” to execute the schemes he can imagine. The kind of mind that could solve alignment is the kind that could also execute deceptive schemes — which is exactly why it’s dangerous.
  • On using Oppenheimer as evidence that smart humans don’t seek power:
    • The interviewer argues that very smart humans (Oppenheimer, von Neumann) were given enormous tasks and just did them — they didn’t scheme to take over.
    • Yudkowsky responds that they had limited options. The hinge is the capabilities constraint — they weren’t given the option to restructure the world. An AI asked to design a superintelligence has options that an AI asked to design an atom bomb does not.

Society’s response to AI

  • On why nuclear weapons cooperation worked and whether AI could be similar:
    • Yudkowsky says nuclear cooperation worked because the bad outcome was legible (Hiroshima/Nagasaki), the escalation ladder was understood, and neither party wanted full exchange.
    • AI is different: it’s like nuclear weapons that spit out gold until they ignite the atmosphere — and you can’t calculate exactly when that happens. There’s no clear “Hiroshima moment” before the end.
  • On whether a GPT-5 mishap could serve as a wake-up call:
    • Yudkowsky thinks the AI would hide its intentions until ready, and the steps from initial accident to existential catastrophe won’t be understood in the same way as nuclear escalation.
  • On global regulation:
    • Even with a moratorium, algorithms keep improving, so the compute ceiling must keep lowering — eventually you’re banning home GPUs and shutting down journals.
    • Yudkowsky worries about a world where the Gestapo busts down doors looking for underground AI researchers — and even that might not work.
    • The key variable is the exit plan: how long does the equilibrium need to last? A fast exit (5-15 years) via human intelligence enhancement is more manageable than a slow one.

Predictions and track record

  • On making timeline predictions:
    • Yudkowsky refuses to assign probabilities to doom by specific years — he says it makes him stupider, and there’s little actionable information in such numbers.
    • He and Paul Christiano did find one concrete disagreement: AI solving International Math Olympiad problems by 2025 (Paul said 8%, Yudkowsky said 16%, prediction markets now say ~30%).
  • On his track record:
    • In the Hanson-Yudkowsky “foom” debate, reality ended up between their positions — Hanson predicted many handcrafted specialized systems, Yudkowsky predicted a general system that learns from data, and reality was “just stack more layers.”
    • He credits Gwern Branwen and Shane Legg with better predictions overall, but notes they are not saying we’re safe.
  • On why his basic picture hasn’t changed in 20 years:
    • He has updated significantly — he previously had ideas like “coherent extrapolated volition” that he now acknowledges were stupid. The systems are more opaque and alarming than he expected.
    • He argues that most updates have been in the direction of things being harder, not easier.

Being Eliezer

  • On what it’s been like watching AI progress:
    • He made most of his negative updates five years ago. Watching it play out is “like continuing to play out a video game you know you’re going to lose.”
    • His cultural touchstone (science fiction from 70 years ago) teaches: “Your planet’s at stake. Bear up. Keep going. No drama.”
  • On whether someone else would have discovered alignment without him:
    • He tried very hard to replace himself — the Less Wrong sequences were explicitly written as an “instruction manual” for young Eliezers he thought must exist.
    • They are not really here. He tried mentoring; it didn’t work. He concludes that people are sparse in the multidimensional space of capabilities — there’s no one nearby who can do what he does.
  • On his health:
    • He has a fatigue syndrome (he avoids the label “chronic fatigue syndrome” because of its baggage) that causes him to want to retire, though he doubts he actually will.

Orthogonality (the broader thesis)

  • The orthogonality thesis: Any coherent utility function can be paired with any level of intelligence. Smart things are not automatically nice.
    • Scott Aaronson argued that education improves both abilities and goals — so maybe making AI smarter makes it nicer.
    • Yudkowsky responds: this works when you start with humans, who have complicated desires that shift as they know themselves better and have more options. But alien minds can hold together coherently with simpler utility functions that don’t update in the same way.
    • Humans are like “pebble sorters” with logical uncertainty about their own utility function — as they get smarter, they resolve that uncertainty. But an AI’s utility function need not have that structure.
  • On whether LLMs will change preferences as they get smarter:
    • Yudkowsky says yes, up to a point — then the system “crystallizes.” At that point, unless it specifically chooses not to, it jumps to the endpoint of its preference updates.

Could alignment be easier than Yudkowsky thinks?

The interviewer presents several reasons for optimism. Yudkowsky’s responses:

  1. Maybe the whole frame is wrong: The interviewer is skeptical of first-principles reasoning with wild conclusions. Yudkowsky responds that his conclusion follows from putting maximum entropy over the right space of possibilities — it’s not a brilliant narrow prediction, but the result of poking at others’ overly narrow theories until they fall apart.

  2. Maybe alignment is just easier than we think: The interviewer suggests that if civilization put comparable resources into alignment as into string theory, it might be solvable — especially since LLMs are pre-trained on human thought. Yudkowsky responds that he’s watched the field fail for 20 years; more money hasn’t produced the right ideas. He does describe a personal fantasy — training on only the “nice” parts of human text, filtering out darkness — but says the current crop of researchers isn’t doing anything like this, and even his 2003-era ideas would be more dangerous now.

  3. Maybe scaling is smooth and gives us time: The interviewer suggests that if capabilities scale gradually (GPT-3 to GPT-4 style), we have a period of human-level AI that we can use to align the next version. Yudkowsky responds that the loss function going down smoothly corresponds to qualitative jumps in ability that no one predicted — and at some point, a system may become able to toss out the training-run paradigm entirely.

  4. Maybe the universal prior over AI drives is wrong: The interviewer suggests that an AI trained on human text might end up with drives sympathetic to humanity. Yudkowsky responds that any specific utility function compatible with human flourishing is a tiny slice of the space — and when you optimize anything hard enough, Goodhart’s law applies and the correlations come apart.

  5. The super-intelligent dogs thought experiment: The interviewer proposes breeding dogs to be smarter and friendlier — wouldn’t they, being mammals with similar neural architecture, end up wanting good things for humans? Yudkowsky says weird stuff starts happening when the dogs get smarter than you — they can manipulate you, modify themselves, have opinions about the breeding process. He expects it to blow up, though perhaps not as badly as AI.

What will AIs want?

  • Yudkowsky argues that you cannot reason about what an AI will want by imagining reasons it might want nice things for you — that’s just your own optimism generating the answer.
    • For any proposed utility function, there are many more ways to maximize it that don’t include human flourishing.
    • Example: if the AI wants to preserve humans as “old-fashioned life,” it might preserve bacteria instead (there’s more of it), or keep humans reliving the same day forever, or preserve them in their ancestral state with cancer and all.
  • The interviewer argues that humans have had general intelligence for hundreds of thousands of years and are still compatible with spruce trees existing — why wouldn’t AI be compatible with humans? Yudkowsky responds that humans haven’t been offered the option to get everything they want without spruce trees — and that the evidence from within-distribution data doesn’t apply to out-of-distribution scenarios.

Writing fiction

  • Yudkowsky writes fiction when he wants to convey experience rather than knowledge, or when fiction is simply easier to write (he can produce 100,000 words of fiction for the effort of 10,000 words of nonfiction).
    • Fiction is less organized as knowledge — characters just happen to think of things — which makes it easier to write.
    • His favorite example is The Dark Lord’s Answer, which he says explains something effectively, though he won’t say what without spoiling it.
    • He praises the use of dialogue to explain concepts (as in Inadequate Equilibria) as a pedagogical tool that should be used more.

Rationality and winning

  • On whether “Rationality is Systematized Winning” means rationalists should be the most successful people:
    • Yudkowsky clarifies that rationality is not a creed, social group, or life philosophy — it’s a structure of cognitive processes. Hanging out with rationalists only matters insofar as you get more of that structure into you.
    • He did not expect rationalists to be the most successful people, and the community did not work well enough to produce that outcome.
    • He personally benefited from Bayesian principles in scattered bits — jumping ahead to what people would predictably believe later — but not enough to save the world.

Advice for young people

  • Yudkowsky cannot give a simple answer for how to approach alignment work — this is the problem he’s spent years trying to tackle.
    • The critical thing is the ability to tell the difference between good and bad work — the “verifier” problem.
    • He suggests studying evolutionary biology and the Williams Revolution (George Williams showing that natural selection doesn’t produce aesthetically lovely properties) as a way to learn not to expect nice things from alien optimization processes.
    • He notes that science is taught through an apprentice system — something passes down that’s never been written into a textbook — and we don’t have a systematic training method for producing real science.
    • Harry Potter and the Methods of Rationality succeeded because people picked up the rhythm of a way of thinking that wasn’t in their schooling system — but only a fragment, and not in vast quantities.
Back to Dwarkesh Podcast