The 20-year journey to fully autonomous cars with Dmitri Dolgov of Waymo

Stripe's Cheeky Pint 1h2 5 min #9
The 20-year journey to fully autonomous cars with Dmitri Dolgov of Waymo
Watch on YouTube

Summary

  • Dmitri Dolgov, Co-CEO of Waymo, traces the 20-year journey from Google’s earliest self-driving research to a service now completing nearly 500,000 fully autonomous rides per week across 10 US cities, with international expansion into London and Tokyo underway. He explains the technical architecture, the AI breakthroughs that made it possible, the operational complexity behind the scenes, and why full autonomy is a qualitatively different problem from driver-assist systems.

Background: From Russia to Google

  • Dolgov grew up in the Soviet Union, moved with his family to Japan and then the US, then returned to Russia in 1994 for his bachelor’s and master’s in physics and applied math before coming back to the US for graduate school in computer science.
  • He joined Google’s self-driving car project in 2009 as one of its first engineers and rose through the ranks until taking over as Co-CEO in 2021.

How Waymo Works: The Technical Architecture

  • Sensors: Three complementary modalities — cameras, LiDAR (lasers), and radar — provide 360-degree coverage. Each has different physical properties: LiDAR gives high-resolution 3D mapping, radar penetrates adverse weather (fog, snow, heavy rain), and cameras excel in good lighting but degrade in darkness or glare.
  • On-car inference: All real-time processing happens locally on the car, not in the cloud. Sensor data goes into encoders, then a decoder (the generative AI component) figures out how to drive, and the system actuates the vehicle through a specialized interface.
  • Cloud tasks: Non-real-time tasks like detecting lost items or checking if the car is dirty happen off-board using models in the cloud.
  • Foundation model approach: Waymo starts with a large off-board foundation model that understands the physical world and driving, then specializes it into three “teachers”:
    • The Waymo Driver — distilled into the on-car model
    • The Simulator — powers synthetic generative environments for training and evaluation
    • The critic — evaluates what constitutes good vs. bad driving behavior
  • All three teachers share the same foundation, which is why improvements at the foundation level ripple across the entire system.

End-to-End vs. Augmented Architecture

  • A pure end-to-end system (pixels in, trajectories out) can drive reasonably well in normal cases — Waymo demonstrated this with a paper called EMMA that fine-tuned a vision-language model to output trajectories instead of text.
  • However, pure end-to-end is orders of magnitude short of the safety bar required for full autonomy and makes simulation, training, and evaluation extremely inefficient.
  • Waymo’s approach augments end-to-end learning with structured intermediate representations — objects, roads, signs, speed limits — which provide additional knobs for simulation, safety validation layers, and reward function specification.
  • This hybrid architecture is what enables the system to handle the long tail of edge cases at scale.

Why It Took 20 Years

  • Dolgov rejects the framing of “going down wrong cul-de-sacs” in favor of iterative learning and evolution.
  • Key enabling breakthroughs were necessary: transformers (which power both LLMs and Waymo’s models), compute scaling, and the ability to apply these architectures to the physical world.
  • He says you could not have built a successful Waymo in 2015 — the technology simply wasn’t there yet.
  • The jump from Generation 4 (many small ML models) to Generation 5 (AI as the backbone) was the critical discontinuous leap, enabled by training on data collected across many US states and cities.

Driving Nuance: What Waymo Optimizes For

  • Safety is the primary focus, but the system also optimizes for smoothness, predictability, and social compatibility with other roaders.
  • Drop-offs are surprisingly nuanced — understanding where to stop without blocking driveways, when double-parking is acceptable, and making the experience frictionless for riders.
  • Freeways are well-structured but have a long tail of severe-consequence events (debris, spinning accidents, unsecured loads) where speed makes everything quadratic in severity.
  • Dolgov says the core technology is now good enough that no fundamental gaps remain; the current phase is specialization, validation, and global scaling.

Generalization Across Cities and Conditions

  • Waymo thinks in terms of an operating domain (weather, road types, density) rather than city-by-city deployment.
  • The core technology generalizes well — moving to London or Tokyo requires data collection and specialization but not rebuilding from scratch.
  • Cold weather is the hardest generalization challenge because it affects the entire stack: hardware needs heating and cleaning elements, and motion control on slippery surfaces requires specialized work.
  • The integration with vision-language models (VLMs) has enabled strong zero-shot and few-shot learning for new environments.

Hardware Evolution

  • Generation 5 (current): Jaguar I-PACE retrofit with Waymo’s sensor stack.
  • Generation 6 (launching this year): Custom-designed vehicle on the Ojai platform — spacious interior, sliding doors, flat floor, designed around the passenger rather than the driver. The sensor suite is simpler, more capable, and much lower cost (comparable to a high-end ADAS system).
  • Cost reductions come from mature camera supply chains, automotive radars that dropped from plane-scale expense to tens of dollars, and LiDAR following a predictable cost-decline curve.
  • The same Waymo Driver software generalizes across vehicle platforms, including a planned deployment on the Hyundai Ioniq.

Emergent Behavior

  • Dolgov describes a moment where a Waymo vehicle detected a pedestrian behind a bus — not through the bus, but because peripheral LiDAR reflections bounced under the bus and captured the movement of the person’s feet. The AI correctly predicted the pedestrian’s behavior and responded appropriately.
  • This kind of performance is extremely difficult to achieve with a purely end-to-end imitative system; it requires the intermediate representations and sensor fusion that Waymo’s architecture provides.

Scaling and Operations

  • Scale today: ~3,000 cars on the road, ~500,000 rides per week, over 4 million fully autonomous miles per week, operating in 11 US cities (10 with public riders, Nashville being the newest).
  • Depot operations: Cars automatically return to depots for charging or cleaning. Fleet management systems flag issues (e.g., a mess left in the car), and human workers handle cleaning and plug-in charging. Automated or inductive charging is a future possibility.
  • Rider behavior: Generally very good — Dolgov attributes this partly to the psychological effect of being in a clean, private space that feels like your own. It varies by context (e.g., a college town on a Saturday night).
  • Expansion trajectory: It took about 8 years from the first fully autonomous commercial service (Chandler, Arizona, 2020) to launching in four new cities in a single day.

Full Autonomy vs. Driver-Assist Systems

  • Dolgov sees these as fundamentally different problems, not points on a single spectrum. The hardest parts of building a rider-only autonomous system are qualitatively different from driver-assist.
  • He does not believe you can incrementally work up from driver-assist (Level 2/3) to full autonomy (Level 4/5) — it requires tackling the full problem directly.
  • Personally owned autonomous vehicles are a requested product direction but not something he would commit to on a specific timeline.

Second-Order Effects of Autonomous Driving

  • Traffic flow: Smooth, predictable driving would eliminate standing waves caused by abrupt braking — the “slow is smooth, smooth is fast” principle.
  • Urban design: Parking lots and garages consume enormous amounts of land because cars sit idle 90% of the time. Widespread autonomy could free up this space for other uses, fundamentally reshaping cities.
  • Parking minimums currently dictate urban layout (e.g., a coffee shop unable to add outdoor seating because it would remove parking spots).

Google’s Role

  • Dolgov credits Larry Page, Sergey Brin, and Alphabet’s leadership for having the vision and stamina to sustain a 20+ year project.
  • He attributes Google’s success in AI broadly to a culture of not accepting the status quo, investing in fundamental technology early (transformers, quantum computing), and attracting technical talent capable of going the distance.
Back to Stripe's Cheeky Pint