2 Robotics Pioneers Unpack the Path to Generalist Robots

Unsupervised Learning 1h10 6 min #46
2 Robotics Pioneers Unpack the Path to Generalist Robots
Watch on YouTube

Summary

  • Physical Intelligence (PI), a leading AI robotics startup co-founded by Karol Hausman (CEO) and researcher Danny Driess, is building generalist foundation models for robots—models that can perform a wide range of physical tasks across diverse environments and hardware. Having raised over $400 million and released influential models like π0 and π5, the company sits at the frontier of a shift from hand-coded robotics to learning-based, end-to-end systems. This conversation traces the evolution of AI robotics over the past decade, explains what works and what doesn’t today, and explores the path toward robots that generalize, perform reliably, and eventually operate in homes and workplaces at scale.

The Shift from Hand-Coded to Learning-Based Robotics

  • For most of robotics history, engineers tried to manually program robots to handle specific tasks—writing explicit rules for perception, planning, and control. This approach hit a wall because the real world is too complex and variable to capture in code.
  • Around 7–10 years ago, the field began shifting toward learning-based methods: robots learning from experience rather than being explicitly programmed. This was initially unsatisfying to many researchers because it reduced human interpretability and control.
  • The real breakthrough came when large-scale models—inspired by advances in language and vision—were applied to robotics. Instead of solving one task at a time, researchers started training single models on many tasks simultaneously, enabling broader capability and generalization.

Key Breakthroughs: From PaLM-E to RT-2

  • Early efforts like PaLM-E grounded large language models in the real world by pairing them with robot controllers that could execute actions. But these systems still lacked direct perception.
  • RT-2 marked a turning point: it integrated vision directly into a vision-language model (VLM) and trained it on both internet-scale data and robot demonstrations. Crucially, it showed that robots could generalize using knowledge from the internet—e.g., moving a Coke can to a picture of Taylor Swift, despite never having seen Taylor Swift or that exact task before.
  • This demonstrated that pre-trained VLMs provide a powerful backbone of world knowledge, and only a relatively small amount of robot-specific data is needed to connect that understanding to physical action. It also meant the field no longer needed to build an “internet of robot data” from scratch.

Where Robotics Foundation Models Stand Today

  • PI frames progress along three axes:
    • Capability: Can the model do complex, dexterous, long-horizon tasks? (Largely solved—π0 showed robots folding laundry, building boxes, etc.)
    • Generalization: Can it work in unseen environments? (Partially solved—π5 demonstrated robots performing tasks in entirely new homes they’ve never visited.)
    • Performance: Can it match human-level speed, accuracy, and reliability? (Still unsolved—current models are demo-ready but not deployment-ready.)
  • Generalization was a major milestone: π5 showed that with enough diversity in training environments (~100 homes), performance in a new, unseen home matches what you’d get if you had trained on that specific home. This suggests the world may be less diverse than assumed—at least for many household tasks.
  • Performance remains the biggest challenge. Failures are common, and improving robustness likely requires new algorithmic ideas, not just more data.

Hardware Is Not the Bottleneck—Intelligence Is

  • A common misconception is that robotics is limited by hardware. In fact, modern robots (including humanoids) are already capable of extraordinary feats when teleoperated by humans.
  • The real bottleneck is software: giving robots the intelligence to perceive, reason, and act autonomously in unstructured environments. If robots had human-level cognition, today’s hardware would suffice for many applications.
  • Hardware will eventually become a constraint, but only after intelligence is solved.

Comparing Robotics to Self-Driving Cars

  • Both are high-impact, hard problems—but they differ fundamentally:
    • Self-driving avoids physical contact; manipulation requires it, making control exponentially harder.
    • Even if a self-driving car knows a maneuver is safe, executing it is straightforward. For a robot, knowing what to do (e.g., fold a shirt) doesn’t make how to do it easy.
  • However, both face a “long tail” of rare, unpredictable events. Progress may follow a similar arc: slow and incremental for years, then suddenly “here”—as seen with Waymo in San Francisco.
  • Unlike LLMs, which surprised everyone with rapid emergence, robotics will likely have a longer road—but a similarly transformative arrival.

Data: The Core Challenge and Opportunity

  • Robotics data is fundamentally different from LLM data:
    • It’s multimodal (cameras, robot states, actions, language), time-series, and grows daily.
    • Even failed attempts are valuable—they reveal how objects deform, slip, or respond to force.
  • PI builds custom in-house data infrastructure because off-the-shelf tools can’t handle the scale, velocity, and iterative nature of robotics data (e.g., updating language annotations as understanding evolves).
  • Key unsolved problems include:
    • How to decide what data to collect next.
    • How to assess data quality and coverage at scale.
    • How to train human operators to collect high-quality demonstrations efficiently.
  • Task selection is driven by complexity and ease of data collection: PI focuses on hard, variable tasks (like laundry folding) that push the limits of current models and are easy to set up in homes.

Evaluation: One of the Hardest Problems in Robotics

  • Unlike LLMs, there’s no “Sweetbench” for robotics. You can never reset a scene exactly the same way, so evaluations are inherently noisy.
  • PI uses relative comparisons (new model vs. baseline, tested simultaneously by the same operator) rather than absolute metrics to account for variance in lighting, robot wear, and human judgment.
  • As models improve, evaluation must scale across more tasks and environments, making it operationally heavy and slow.
  • Simulation and video models are promising but not yet reliable enough to replace real-world testing.

Simulation: Useful, But Not Yet a Data Source

  • Simulation works well for locomotion (where the challenge is modeling the robot’s own body) but poorly for manipulation (where the challenge is modeling everything the robot interacts with).
  • PI sees simulation’s near-term value in evaluation, not training. If sim can reliably predict real-world performance, it would accelerate iteration.
  • Simulation may become more useful as real-world data becomes more diverse—allowing models to treat sim as just another “reality.”
  • The team remains open-minded: if sim improves enough, they’ll adopt it fully.

Research Strategy: Taste, Speed, and Openness

  • PI prioritizes hiring researchers with strong “taste”—the ability to identify promising ideas, change their minds with evidence, and do whatever it takes (e.g., building data pipelines even if it’s not their usual role).
  • They avoid dogma, run focused bets with sufficient resources, and iterate quickly.
  • Open-sourcing models like π0 is a deliberate strategy: it invites community input, accelerates progress across the field, and aids recruitment. The biggest risk isn’t competition—it’s scientific failure. More minds on the problem increases the chance of success.
  • π0 has been used in surprising ways: on drones, surgical robots, and autonomous cars—demonstrating its broad utility.

Training Faster and Smarter: Knowledge Insulation

  • Fine-tuning VLMs on robot data often degrades their original capabilities (“catastrophic forgetting”), hurting generalization and slowing training.
  • PI’s solution, knowledge insulation, involves:
    • Continuing to train on web data during robotics fine-tuning.
    • Using tokenized actions (converted to text-like tokens) to adapt the VLM backbone without corrupting it.
    • Adding flow-matching “action experts” on top, with gradients blocked from flowing back into the backbone.
  • This approach sped up training by 10× and improved generalization—critical in a field where each training cycle can take weeks.

Solving Latency with Algorithmic Tricks

  • Large models have inference delays (hundreds of milliseconds), during which the world changes. This can cause inconsistencies between planned and actual actions.
  • Kevin Black at PI adapted image inpainting techniques from diffusion models: the system executes an action chunk while computing the next one, then “inpaints” the transition between them—smoothly fusing old and new plans.
  • This is a purely algorithmic fix—no retraining needed—and significantly improves real-time responsiveness.

The Future: Generalist Base Models with Light Post-Training

  • The goal is to build increasingly capable base models that require minimal post-training for new tasks—similar to how modern LLMs work out of the box.
  • Eventually, users might fine-tune with just a few demonstrations or natural language guidance (“be more careful with the glasses”).
  • PI already sees early signs: some tasks that once required complex post-training now work zero-shot from the base model.
  • In 10 years, the vision is “vibing intelligence” into hardware: creating a physical device and simply prompting it to life, like in cartoons.

Predictions and Implications

  • When will robots be in homes? Karol initially underestimated progress; now believes deployment in homes doing useful tasks could happen within 5 years. Danny agrees the timeline has shifted from “never” to “5–10 years,” possibly sooner.
  • Overhyped: The current craze around humanoid robots. Generalist models can work across form factors—humanoids are just one option. The real value is in the model, not the body.
  • Underhyped: Generalist robotics models and the visceral impact of physical AI (e.g., Waymo rides feel more transformative than chatbots).
  • Unexpected implication: Just as “vibe coding” empowered non-technical people to build software, “vibe hardware” could let anyone create intelligent physical devices. And if robots handle chores, humans may finally gain time for what matters most—like spending time with family.
Back to Unsupervised Learning