John Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI

Dwarkesh Podcast 1h35 5 min #67
John Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI
Watch on YouTube

Summary

  • John Schulman, co-founder of OpenAI and leader of its post-training team, explains how AI models are built in two major stages—pre-training and post-training—and how future progress will unlock dramatically more capable agents, with implications for AGI timelines, safety, and the global economy.

Pre-training vs. post-training

  • Pre-training teaches the model to imitate the entire internet by predicting the next token, giving it broad knowledge and calibrated probabilities across all kinds of content and personas.
  • Post-training narrows the model’s behavior into a helpful chat assistant, optimizing for outputs humans find useful rather than raw imitation.
  • The distinction matters because post-training is where alignment, personality, and reliability are shaped—and where most gains since GPT-4 have come from.

Future capabilities and long-horizon tasks

  • Within a few years, models will handle much longer, more complex tasks—like executing entire coding projects with multiple files, testing, and iteration—rather than just single-step suggestions.
  • This requires training models on longer-horizon tasks using reinforcement learning (RL), which is still a new area with lots of low-horanging fruit.
  • Models will also get better at recovering from errors and generalizing from fewer examples, making them more sample-efficient.
  • Schulman doesn’t expect a clean scaling law for task length, but anticipates possible phase transitions where certain capabilities unlock across multiple timescales at once.

Generalization and transfer

  • Models already generalize surprisingly well: training on English data leads to reasonable behavior in other languages; text-only fine-tuning improves image understanding; a handful of examples can teach a model its own limitations.
  • This suggests that stronger models may need far less data to acquire new skills, transferring knowledge from pre-training to novel situations.
  • However, Schulman cautions that generalization alone won’t immediately solve all deficits—models still struggle with deep reasoning, attention to detail, and handling ambiguity.

Plan for AGI and coordination challenges

  • If AGI arrives sooner than expected (e.g., within 2–3 years), Schulman says OpenAI would slow down training and deployment until safety is better understood.
  • Coordination among major AI labs would be needed to avoid race dynamics that compromise safety—though maintaining such an equilibrium is difficult.
  • A preferable scenario involves continuous, incremental releases where each improvement is matched by corresponding safety gains, allowing for quick slowdowns if things look risky.
  • Proof of safety would involve extensive testing, red teaming, monitoring systems, and defense-in-depth combining model behavior with external oversight.

Teaching models to reason

  • Reasoning can be improved either by training on successful chains of thought or by using extra compute at inference time for the model to “think aloud.”
  • Schulman sees value in both approaches but emphasizes the importance of practice during training.
  • He also highlights a missing middle ground between massive pre-training and in-context learning: something like active learning, where models introspect on their knowledge gaps and seek out new information.
  • For long-horizon tasks, models will need memory and learning that updates during the task itself, blurring the line between short-term and long-term memory.

The road to ChatGPT

  • Before ChatGPT, OpenAI focused on instruction-following models that were hard to prompt and often unreliable.
  • Chat emerged as a more intuitive interface: people naturally understood what a helpful robot should be like, making data labeling easier and producing more coherent personalities.
  • Early versions of ChatGPT mixed instruct and chat datasets, combining strengths of both approaches.
  • While others could have approximated ChatGPT using OpenAI’s fine-tuning API, achieving competitive performance would have required iterative supervised fine-tuning or RL—non-trivial without OpenAI’s infrastructure.

Post-training’s growing importance

  • Post-training has driven most improvements since GPT-4’s release, with Elo scores increasing by ~100 points due to better data quality, quantity, and annotation methods.
  • Schulman expects post-training to consume a growing share of total compute as models generate higher-quality outputs than most web content, making self-imitation more valuable than raw pre-training.
  • There are strong arguments for shifting resources toward post-training as models become smarter and more capable of learning from their own outputs.

What makes a good RL researcher

  • Success requires understanding the full stack—from RL algorithms to data collection to annotation processes—and combining empirical experimentation with first-principles thinking.
  • Curiosity about every part of the pipeline and willingness to let experiments update beliefs are key traits.
  • The field is relatively healthy compared to social sciences, with strong incentives for reproducibility and open sourcing, though there are mild pathologies like baseline manipulation and mathematical sophistication for its own sake.

Data walls and generalization

  • Schulman is skeptical of claims that we’re about to hit a data wall, noting that it takes time to prepare and train new generations of models.
  • He acknowledges challenges from limited data but expects the nature of pre-training to evolve as we approach those limits.
  • On cross-modal transfer (e.g., code improving reasoning), he notes the difficulty of running ablation studies at scale but suggests larger models may learn better shared representations than smaller ones.

Why bigger models are more sample-efficient

  • Larger models have more parameters and thus a larger library of potential computations, increasing the chance that useful circuits emerge and get reinforced.
  • They act somewhat like ensembles or mixture models, combining many parallel computations with learned gating.
  • Compositionality allows chaining functions together, giving bigger models more flexibility to solve problems even with less data per parameter.

Keeping humans in the loop

  • As models become capable of running entire firms, there’s a tension between efficiency and oversight.
  • Schulman hopes people remain the drivers of AI systems, directing them toward meaningful pursuits rather than fully autonomous operation.
  • Economic pressures may push toward removing humans from loops, requiring regulation or liability frameworks to maintain accountability.
  • Practical considerations—like AI-run firms having higher tail risk due to rare malfunctions—may naturally slow full automation.

Alignment and stakeholder trade-offs

  • Current RLHF aggregates preferences from human raters, but future high-stakes applications will require navigating conflicts between users, developers, platforms, and society.
  • OpenAI’s Model Spec outlines how to resolve these tensions: mostly follow user instructions unless they harm others, avoid paternalism, and remain neutral.
  • For smarter models, alignment may involve distilling complex preference datasets into shorter documents or relying on models’ own learned moral theories.

RLHF’s homogenizing effects

  • Chatbots often sound similar—verbose, formal, fond of bullet points and words like “delve”—due to biases in labeling and reward modeling.
  • Some quirks may stem from unintentional distillation between models and labelers who use AI tools to complete tasks.
  • People do prefer structured, comprehensive answers, but verbosity may also result from training on single messages rather than full conversations.
  • Faster streaming speeds might shift preferences toward conciseness.

Moats and competition

  • Post-training creates a moat because it requires complex operations, tacit knowledge, and skilled teams—not easily replicated.
  • However, distillation (cloning outputs or using models as judges) allows smaller players to catch up, partially eroding the moat.
  • The same companies leading in pre-training also lead in post-training, reinforcing their advantage.

Rater demographics and expertise

  • Raters are international, with U.S.-based labelers often handling writing tasks and Indian/lower-middle-income-country labelers handling STEM tasks.
  • Some raters are highly skilled—researchers sometimes find them more careful and accurate than themselves.
  • Domain expertise helps but isn’t always necessary: base models already know a lot from documentation, and preference training generalizes across domains.

Multimodal agents and proactive assistance

  • Future models will interact with screens and UIs using vision, enabling them to act as integrated agents in workflows.
  • The form factor could range from a Clippy-like assistant to a cloud-based colleague who knows your entire project history.
  • Proactivity is a key missing feature: models should suggest next steps, remember follow-ups, and work in the background.
  • Schulman expects collaboration to shift from one-off queries to ongoing partnerships where the model understands your goals and context deeply.

Timeline for replacing his own job

  • Schulman estimates five years before AI replaces his role, reflecting his expectation of rapid progress in post-training, long-horizon reasoning, and agent capabilities.
Back to Dwarkesh Podcast