Stanford AI Researcher on What’s Next in Research, Reaction to o1 and How AI will Change Simulation — Unsupervised Learning

Percy Liang is a Stanford professor and co-founder of Together AI, one of the most influential figures in AI research today. In this episode, he shares his reaction to OpenAI’s o1 model, discusses the evolving role of academia in AI, explores the potential of generative agents and simulation, and offers deep insights on evaluation, interpretability, regulation, robotics, and AI in music. His central theme is that AI progress is not just about bigger models but about rethinking how models are used, evaluated, and integrated into broader systems.

Reaction to OpenAI’s o1 Model

From a product perspective, Percy found o1 slow and not particularly useful for most tasks he tried.
From a research perspective, he sees o1 as a signal of a broader shift toward test time compute—the idea that AI systems can spend more time reasoning to solve harder, more ambitious tasks.
- This reframes how we think about language models: not just as fast prompt-response systems, but as agents that can work on tasks taking days, weeks, or months.
- It echoes the reinforcement learning era (e.g., AlphaGo), but now applied to general-purpose agents that take actions over long horizons and learn from feedback.
He tested o1 on CyberSecBench, a Capture the Flag cybersecurity benchmark his team developed:
- These are extremely hard challenges—some take human teams over 24 hours to solve.
- Current models solve challenges that humans first solve in about 11 minutes, showing a large gap.
- When dropped into existing agent frameworks, o1 didn’t improve overall scores because it ignored the scaffolding (reflection, planning templates) and just generated answers directly—highlighting a compatibility problem between new models and existing systems.
He cautions that raw benchmark scores don’t tell the full story: if a model doesn’t fit into a larger system’s architecture, improvements may not materialize.

The Shifting Paradigm of AI Systems

Much of the current AI application stack—prompt chaining, scaffolding, reasoning templates—was built around GPT-4-level models.
Percy argues this scaffolding is dispensable and will likely change significantly:
- o1 internalizes reasoning, making it invisible to users and developers.
- This creates a tension: when it works, it’s powerful (e.g., solving IMO problems), but when it fails, it’s hard to debug because there’s no trace or stack trace.
- This reduces customizability, which matters for novel applications or domains where training data is lacking.
He notes that even with closed models, we’re losing transparency: previously, with open-weight models like GPT-4, at least the prompts were visible; now, even that is internalized.

The Role of Academia in AI Research

Many academics feel existential anxiety as frontier labs release increasingly powerful models with far more resources.
Percy’s advice: be orthogonal—choose research directions that are either enhanced by or irrelevant to new model releases.
- Examples:
  - Generative agents: better models make simulations more realistic and useful.
  - ML Agent Bench: benchmarks for AI solving ML research tasks improve as models improve.
He sees three valuable roles for academia:
1. Open science: publishing knowledge that can be adopted by the broader community, even if it means reinventing what frontier labs may already know.
2. Transparency and benchmarking: academia is uniquely positioned to audit and evaluate models without commercial conflicts of interest.
3. Interdisciplinary collaboration: working with law schools and other departments on societal implications of AI.

AI Safety and Regulation

Percy argues that AI safety is often too narrowly focused on making individual models “safe” through alignment techniques like RLHF.
He advocates for a holistic, ecosystem-level view:
- Bad actors can circumvent safety measures by decomposing harmful tasks (e.g., writing a phishing email by first asking for research paper recommendations, then swapping links).
- Defense is underinvested: just as we have spam filters and fraud detectors for email, we need analogous tools for AI misuse.
- Trying to gate access to models is a losing battle as they become cheaper and more widespread.
On regulation:
- He supports transparency and disclosure as a first step—understanding risks and benefits before regulating.
- He distinguishes between upstream (foundation model developers) and downstream (end products) regulation, favoring the latter as more effective and less blunt.
- He compares model disclosures to nutrition labels—providing spec sheets so downstream developers can make informed decisions.

Generative Agents and Simulation

Percy’s Generative Agents project (with Joon Park and Michael Bernstein) created a Sims-like virtual world where AI agents powered by language models interacted with each other.
- Agents exhibited emergent social behaviors: information diffusion, persuasion, and coordination.
- The goal was believable simulation—useful for games and entertainment.
The next frontier is valid simulation—simulations that accurately reflect reality, enabling:
- Policy experimentation (e.g., simulating the effects of a mask mandate or new law).
- Social science studies with demographically diverse agent populations.
- Digital twins in medicine (e.g., simulating what would have happened to a patient under a different treatment).
  - Advantage: you can give the same agent both treatment and control by resetting its memory.
He distinguishes two types of agents:
1. Task-performing agents (like o1): solve difficult problems over long time horizons.
2. Simulation agents: mimic human behavior for study and experimentation.
Unlike traditional simulations (e.g., weather, disease spread), generative agent-based modeling can simulate complex, detailed interactions because language models can capture nuanced human decision-making.

The State of AI Evaluations

Evaluation is a moving target and a “huge mess”:
- The classic train/test paradigm is broken because we don’t know what’s in training data.
- Even if models don’t train on exact benchmarks, they may have seen similar data.
Percy is excited about using language models to benchmark language models:
- AutoBench: leverages asymmetry—the model generating questions has information the test-taking model doesn’t, enabling more sensible evaluation.
He criticizes current evaluations for relying on superficial judgments (e.g., “B is better than A” for long texts).
- Advocates for rubric-based evaluation, inspired by how exams are graded—anchoring assessments in concrete criteria.
Benchmarks are becoming vertical-specific:
- HELM (Holistic Evaluation of Language Models) has evolved into a framework with leaderboards for safety, multilingual, medical, finance, etc.
- Industries care about domain-specific performance, not general math or coding prowess.

Interpretability Challenges

Interpretability has gotten harder over time:
- In 2017, researchers had access to model weights and training data; now, even weights are often unavailable.
Two main schools of interpretability:
1. Mechanistic interpretability: understanding individual neurons—interesting for scientific understanding but limited for practical applications.
2. Attribution methods: e.g., influence functions—identifying which training examples most influenced a prediction.
  - Useful for debugging but hard to scale and raises privacy concerns (e.g., “you got diagnosed because of this Reddit thread”).
Chain of Thought provides explanations, but these may not reflect the model’s actual reasoning—similar to how people rationalize decisions.
For interpretability to progress, we need access to weights and training data—a return to 2017-era openness.

Model Architectures: Transformers, Mamba, and Beyond

Historically, architectures like LSTMs, CNNs, and Transformers emerged from intuition, gradient analysis, and experimentation.
Mamba/state space models are notable because they were inspired by math—specifically, fitting polynomials to sequences online.
- The math doesn’t directly apply to neural networks, but it inspired a new architecture.
Percy bets that Transformers won’t be the final architecture:
- They’re arbitrary and will likely be replaced, especially in domains like video or agentic settings where they don’t fit naturally.
- Architectural innovation is more likely in new problem domains (e.g., video, robotics) where existing approaches break down.

The Inference Market and Together AI

Percy sees inference as a low-level primitive—a building block that needs to be robust and cheap.
- Everything requires inference: training, agentic workflows, synthetic data generation.
Currently, the inference market is dominated by serving models like Llama 3.
Over time, he expects a shift toward customizing models for specific use cases:
- Fine-tuning for particular applications can yield much faster and better performance than general-purpose models.
- Agentic workflows open new optimization opportunities (e.g., high-throughput generation of many possibilities).
Together AI’s stack: GPUs → inference → fine-tuning/customization.

Milestones and the Path Forward

Near-term progress can be tracked on benchmarks like CyberSecBench and ML Agent Bench, where performance is still well below human expert levels.
Meaningful long-term milestones:
- Solving open math problems.
- Discovering new research or scientific knowledge.
- Finding zero-day vulnerabilities in cybersecurity.
- Anything that extends human knowledge rather than mimicking it.
He doesn’t see a plateau in capabilities:
- Progress is still rapid, driven by both scaling and qualitative shifts (like o1’s test time compute).
- Chips are getting more powerful and cheaper, continuing to drive advancement.

Robotics Foundation Models

Robotics is not at a ChatGPT moment—it’s closer to the “BERT era”:
- Vision-language models can be fine-tuned for robotics, but policies are still narrow and brittle.
- Unlike language, robotics lacks internet-scale data; collection is hard and expensive.
However, he’s optimistic:
- Interest and funding are increasing.
- Data collection efforts are expanding.
- Language and vision can handle high-level reasoning (e.g., identifying a cup), leaving robotics data to focus on manipulation and control.
- No fundamental obstacles—just a matter of time and effort.

AI in Music

Percy is a classically trained pianist and has thought deeply about AI in music.
Challenges in music:
- Copyright is a major hurdle.
- Control: unconditional or text-conditioned generation doesn’t give artists enough control.
His lab developed the Anticipatory Music Transformer, a generalized infilling model:
- Can condition on any subset of musical events (e.g., melody → generate harmony).
- Enables a co-pilot for musicians, similar to GitHub Copilot for coders.
Personal motivation: he has musical ideas he can’t execute due to limited practice time—AI could help realize his vision.
Classical music poses unique challenges: subtlety and limited data.

AI as Educators and Coaches

Percy is bullish on AI as teachers and coaches:
- Great at breaking down complex concepts for different audiences (e.g., explaining to a 5-year-old).
- Useful for practice and preparation (e.g., simulating a podcast interview or a date).

Quickfire Round

Overhyped and underhyped: Agents—they’ve been on both sides of the hype cycle.
ML agents contributing novel research:
- Already possible at a basic level (e.g., running ablation experiments).
- Within years, could meaningfully contribute to research directions, similar to how AI has transformed coding.
Underexplored application areas:
- Fundamental science and scientific discovery.
- Improving researcher productivity—less commercially driven but high-impact.

Summary

Reaction to OpenAI’s o1 Model

The Shifting Paradigm of AI Systems

The Role of Academia in AI Research

AI Safety and Regulation

Generative Agents and Simulation

The State of AI Evaluations

Interpretability Challenges

Model Architectures: Transformers, Mamba, and Beyond

The Inference Market and Together AI

Milestones and the Path Forward

Robotics Foundation Models

AI in Music

AI as Educators and Coaches

Quickfire Round