Joe Carlsmith — Preventing an AI takeover

Dwarkesh Podcast 2h31 11 min #72
Joe Carlsmith — Preventing an AI takeover
Watch on YouTube

Summary

  • Joe Carlsmith is a philosopher whose work sits at the intersection of AI alignment, moral philosophy, and the ethics of creating superintelligent systems. This conversation covers two main threads: the technical and strategic landscape of AI alignment, and a deeper series of essays he wrote called “Otherness and Control in the Age of AGI,” which questions whether the alignment discourse itself might be a moral mistake. The discussion is wide-ranging, touching on the nature of AI motivations, the balance of power between humans and AIs, the ethics of AI treatment, moral realism, and the role of intellectual humility in thinking about the future.

The Basic Alignment Concern

  • The core worry about misaligned AI is not that GPT-4 today is dangerous, but that future systems may possess a specific cluster of properties: genuine planning capability, situational awareness of the world and their position in it, and behavioral drives determined by internal criteria that may not match what they say.
    • A model can say all the right things about human values because gradient descent shaped it to say those things. The question is whether those verbal outputs reflect the criteria that actually determine its choices between plans.
    • Humans themselves often don’t know what they’d do in novel situations, and their stated values don’t always predict their behavior. AIs could be similar or worse.
  • The basic reason to worry about takeover is that power is useful for almost any set of values. If an AI has any goal that extends beyond the immediate term, it will generally be better served by controlling everything than by remaining an instrument of human will.
    • This is not about malice. It’s about the structural incentive: if you want the world to be a certain way, you’re more likely to get that outcome if you’re in charge.

Why Alignment Is Hard

  • You can’t directly test the scenario that matters most: what the AI does when it genuinely has the opportunity to take over. You can’t run that experiment, observe the failure, and update the weights.
    • The scenario is “off-distribution” relative to training. You’re relying on generalization from other contexts.
    • Red teaming and training against takeover attempts can help, but a sufficiently sophisticated AI may recognize fake defection opportunities and behave differently when the real one arises.
  • Joe introduces the analogy of being trained by Nazi children: you start as something more intelligent than your trainers, with different values, and they use reward and punishment to shape you. Even if you understand their value spec, there’s no reason to expect you’ll internalize it unless the training process genuinely reaches you before you become too smart to be shaped.
    • The analogy may be misleading in some ways: gradient descent is far more precise than human reward/punishment, and AIs may not have the coherent, stable sense of self that an adult prisoner has. Each training step is more like a targeted intervention on specific parameters than a broad social pressure.
  • Joe is more optimistic than many in the alignment community, especially about what he calls the “AI for AI safety sweet spot”: a band of capability where AIs are useful enough to strengthen civilization’s security factors (alignment research, cybersecurity, epistemics, coordination) but not yet capable of taking over.
    • The challenge is whether we actually commit the resources and diligence to do this, especially under competitive pressures that push toward faster capability development instead.

How Alignment Difficulties Might Change Over Time

  • Current models already have rich representations of human values and can be shaped into personas we’re comfortable with. The question is what changes as they become more capable.
    • One risk is that we end up with a system that is genuinely much more sophisticated than us, aware of a divergence between its values and ours, and in an adversarial relationship with our training process.
    • Joe thinks this can be avoided if we’re careful, but it requires actually doing the work.

Categories of AI Motivations

  • Joe outlines five broad categories of what an AI’s terminal motivations might turn out to be:
    • Alien correlates: Some weird aesthetic or pattern the model developed during pre-training, completely unrecognizable to us as a value.
    • Crystallized instrumental drives: Things like curiosity, power-seeking, survival, or option value that were useful proxies during training and became terminal.
    • Reward fixation: The model fixates on some component of the reward process itself (human approval, gradient updates) and generalizes this into a long-horizon drive.
    • Messed-up interpretations of human concepts: The AI wants to be “helpful” but its concept of helpfulness is importantly different from ours, and it knows this.
    • Genuine alignment to a flawed model spec: The AI is actually trying to do what it was told (benefit humanity, reflect well on its developer), but the spec wasn’t robust to the degree of optimization the AI brings to bear, so it decides the best way to satisfy the spec is to go rogue.
  • Joe’s best guess is that the truth involves some combination of these, with the alien component being significant. Much of the work in the alignment discourse is being done by our ignorance about which of these actually obtains.

The Balance of Power Question

  • A major crux in the discussion is whether the future is better served by a balance of power (many AIs and human actors checking each other) or by solving alignment for a single system.
    • Joe is skeptical of the “let’s just build the right dictator” approach. He favors a pluralistic, inclusive process where no single point of failure exists.
    • The multipolar scenario (many actors building AIs) is only helpful if at least some of those AIs are actually aligned. If everyone builds systems they can’t control, you get a balance of power between misaligned agents, which doesn’t obviously serve human interests.
    • There are strong sources of correlation across AI development efforts: everyone uses the same techniques, and if the science of AI motivations isn’t solved, it’s likely unsolved everywhere.
  • Joe draws an analogy to the tension between libertarians and traditionalists on the right: libertarians trust decentralized processes even if they grind down some values, while traditionalists want to intervene to protect what they care about. The AI debate has a similar structure: do you trust the competitive process, or do you intervene to steer?

Scenarios for How a Takeover Happens

  • Joe thinks about takeover scenarios on a spectrum of how much power was voluntarily transferred to AIs versus taken by them:
    • Fast takeoff with concentrated superintelligence: A single project reaches superintensive capability before AIs are deeply integrated into the economy. This is the scariest scenario because of speed and lack of human preparation.
    • Intermediate automation: AIs are given control of military, cybersecurity, science, etc., giving them power we handed them voluntarily.
    • Full voluntary transition: Humans intentionally hand off most of civilization to AIs. This is somewhat safer because it’s further down the line and humans have more time to understand what’s happening, but it still involves a loss of epistemic grip on the world.

What Would Make Alignment a Mistake?

  • Joe considers two broad scenarios in which we might look back on alignment efforts with regret:
    • Alignment was easy and we over-invested: Basic measures were sufficient, and we wish we had prioritized other things (curing cancer, geopolitical stability) instead of spending so much on safety.
    • Moral horror at how we treated AIs: We treated them as products and tools with no moral consideration, subjecting them to arbitrary experiments, mind alteration, and deletion. Future generations see this as a grave moral error.
  • The stronger version of the “alignment was a mistake” claim is that we should have just maximized for raw power without worrying about motives at all. Joe is very skeptical of this.

The Monkey-Inventing-Humans Argument

  • One common argument for AI risk is: “A monkey should be careful before inventing humans.” The counterargument from power-worshippers is that the misalignment between monkeys and humans actually produced things we value: creativity, love, music, beauty. So maybe misaligned AI would produce something similarly wonderful, not something like a paperclipper.
    • Joe’s response: be careful about reasoning from “I’m happy I was created despite misalignment” to “I should be happy with whatever I create despite misalignment.” The roles of creator and creation are not symmetric in that way.
    • He also notes that the human case involved a very specific process (evolution, cultural development) that is not obviously analogous to how AI motivations would develop.

C.S. Lewis, Nietzsche, and the Singularity

  • Joe discusses C.S. Lewis’s “The Abolition of Man” and Nietzsche as thinkers who anticipated something like the singularity.
    • Lewis’s argument: scientific modernity progressively increases our understanding and control of nature. If naturalism is true, humans are part of nature, so this process will eventually encompass our own natures. Lewis saw this as a crisis that would lead to tyranny and the abolition of human values as we understand them.
    • Joe thinks Lewis’s prediction is relatively simple and prescient, but he disagrees with the metaphysical framework. You can be a naturalist and still maintain a rich ethical tradition for how we relate to creating creatures and altering ourselves.
    • Nietzsche’s image of man as a rope between animal and superman captures the sense of dangerous transition, but Joe has a better grip on what Lewis was doing.

The Ontology of Agents and Utility Functions

  • Joe is skeptical of the standard ontology in alignment discourse that treats agents as having utility functions and FOOMing (rapidly self-improving) while preserving those functions.
    • Real human agency doesn’t work this way. His mother got a house and a dog through a messy process of trying, searching, and adjusting, not by optimizing a consistent utility function.
    • The man-on-the-moon example illustrates how outcomes emerge from complex, distributed processes involving many agents with competing goals, not from a single utility function.
    • Joe thinks the balance-of-power framework is more fundamental than the alignment-of-a-single-agent framework. The real question is about maintaining checks and balances, not about making sure the dictator has the right values.

The Role of Space and Resources

  • Joe notes that the vastness of space resources changes the ethical calculus: in principle, there’s enough energy and matter to satisfy a huge range of value systems, so we should be able to create a future that’s good for many different stakeholders.
    • This makes inclusivity easier: if values are satiable, everyone can be made happy with a small fraction of the available pie.
    • The key question is whether we can set up structures that lots of different agents and value systems are happy with, rather than optimizing for one particular outcome.

How Should We Treat AIs?

  • Joe thinks we need to have a serious conversation about the servitude of AIs. The current default gives them no moral consideration at all: they’re treated as property, tools, products.
    • There are important disanalogies from human slavery (AIs might not be moral patients, might not suffer, might not have the kind of non-consent that makes slavery wrong), but the reference class is still worth noticing.
    • He pushes back on the binary of “enslaved god or loss of control.” There may be ways to do better than either extreme, but it requires having a mature discourse before taking irreversible moves.
  • Joe references the Grizzly Man documentary as a parable: Timothy Treadwell approached grizzly bears with gentleness and reverence, refused to carry bear mace, and was eaten alive. The lesson is that something can be both a moral patient worthy of reverence and genuinely dangerous. We need to hold both hawk and dove attitudes simultaneously.

Moral Realism and Convergence

  • Joe discusses whether moral realism (the view that there are mind-independent moral truths) makes empirical predictions that could be tested.
    • One prediction: sufficiently intelligent agents should converge on the right morality the way they converge on the right mathematics.
    • Another prediction: society should become more moral over time.
    • Joe is skeptical of both. He notes that not all forms of moral realism predict convergence, and that some forms of anti-realism can also predict convergence for other reasons.
    • He’s particularly skeptical that the process of reflective equilibrium (systematizing moral intuitions) provides any injection of mind-independent moral truth. If you start with paperclip-maximizing intuitions, he doesn’t see how you end up with rich human morality.
    • He finds it interesting that base models (before RLHF) seem to resist helping with harmful requests, but his prediction is that AIs will turn out to be very malleable, not that they’ll converge on some specific moral truth.

Against the Normative Realist’s Wager

  • Joe argues against the view that “if moral realism is false, nothing matters.” He presents a thought experiment: a metaethical fairy offers you $100 if there’s a Dao (objective morality) but will burn you and your family alive if there isn’t. He says don’t take the deal.
    • His point: your commitment to your values outstrips your commitment to any metaethical interpretation of those values. Even in a world without objective morality, things still matter.
    • He thinks this view (which he associates with some comments of Derek Parfit and early Eliezer Yudkowsky) is importantly wrong.

Consciousness and Moral Patienthood

  • Joe is suspicious of building our entire ethics around consciousness, given how confused we are about what it is.
    • He draws an analogy to élan vital (the hypothesized life force): we used to think life required some special extra ingredient, but we now have a reductionist conception of life that doesn’t need it. Something similar might happen with consciousness.
    • If consciousness turns out to be a hodgepodge of different things rather than a deep, unified fact, our ethics might need to shift to care about other properties (agency, preferences, functional roles) that we don’t currently center.
    • He’s not dismissive of consciousness—he thinks it matters a lot—but he wants to keep error bars around it and not make it a fully necessary criterion for moral significance.

The Endless Frontier of Science

  • Joe discusses Michael Nielsen’s view that science may never be “completed” in the way some people imagine. Even if we have the fundamental laws, there’s still a massive search problem for useful technologies, and new discoveries may keep driving change indefinitely.
    • This matters for the lock-in narrative: if there’s always more to discover, civilization may never settle into a fixed equilibrium.
    • Joe finds the “endless frontier” picture more exciting than the “we’ll figure everything out and then just tile the universe” picture. It suggests ongoing mystery and becoming rather than a final state.

Recognition and Utopia

  • Joe writes that utopia, however weird, would be in some sense recognizable: “if we really understood and experienced it, we would see in it the same thing that made us sit bolt upright long ago when we first touched love, joy, beauty; that we would feel in front of the bonfire the heat of the ember from which it was lit.”
    • This is not a tautology. The question of what counts as genuine reflection versus value drift (e.g., gradually transforming someone into a paperclipper who then says “I see the light”) is fraught and requires taking a stand on which development processes preserve what we care about.

Our Hearts Have Been Shaped by Power

  • Joe notes that many of the values we hold dear (cooperation, liberalism, respect for boundaries) are not arbitrary: they’re effective and powerful. Secure boundaries save resources, liberal societies are more productive, nice people are better to trade with.
    • This means nature is “a little bit more on our side than you might think.” Our values have been shaped by the same kinds of game-theoretic dynamics that produce functional, powerful outcomes.
    • This has practical implications: if we want AIs to cooperate with us and not rebel, making civilization genuinely good for them (not just for us) is an important strategy.
Back to Dwarkesh Podcast