Bob McGrew, former Chief Research Officer at OpenAI (served ~6.5 years until November 2024), shares his perspective on where AI is heading, what it takes to build frontier research organizations, and why he left OpenAI after shipping the o1 model. He argues that progress is far from hitting a wall, that the next major unlock will come from test-time compute and agentic form factors, and that the real scarcity in an AI-saturated world will be human agency, not intelligence.
Are We Hitting a Wall in AI Capabilities?
The perception that AI progress has stalled comes mainly from outsiders who expected a steady stream of new model releases after ChatGPT and GPT-4, but the reality inside frontier labs is different.
Each generation of pre-training requires roughly a 100x increase in effective compute, which means waiting for new data centers to be built, a multi-year process.
Algorithmic improvements can help (2x–3x gains), but the dominant factor is scaling compute, which is slow and capital-intensive.
The release of o1 (named differently from “GPT-5” but effectively a new generation) represents a ~100x compute increase over GPT-4, achieved not through pre-training but through reinforcement learning and longer Chain of Thought reasoning.
This signals a shift: the next wave of progress will come from how models use compute at inference time, not just from training bigger models.
Bob expects 2025 to be defined by progress in test-time compute, where models spend more time “thinking” to solve harder problems.
In theory, the same principles that let o1 think for minutes could extend to hours or days of reasoning, though scaling this is itself a hard engineering challenge.
New Form Factors: From Chatbots to Agents
Most current chatbot interactions are low-stakes and short, where GPT-4-class models already work well; the pain point is that there’s no compelling daily use case for most consumers to need more.
The real unlock for models like o1 is enabling long-term, multi-step tasks—agents that can take actions on a user’s behalf over extended periods.
Programming is a natural fit because it’s structured, leverages reasoning, and involves sustained effort.
Other examples include writing long policy briefs, booking travel, or managing workflows that require interacting with external tools and services.
Bob believes the key form factor that needs to be solved is one where models take actions in the world—shopping, sending emails, pushing code—not just generating text.
Reliability: The Core Challenge for Enterprise and Agents
The single biggest barrier to deploying AI agents is reliability.
Going from 90% to 99% reliability requires roughly an order of magnitude more compute; going from 99% to 99.9% requires another order of magnitude.
Every additional “nine” of reliability represents a year or two of progress.
When an agent only thinks or writes code, a mistake wastes time; when it takes actions in the world (sending messages, making purchases), mistakes have real consequences (embarrassment, financial loss).
Enterprise deployment adds further complexity because tasks require deep contextual knowledge: who your coworkers are, what projects you’re working on, what tools and norms exist in the organization.
This context lives in Slack, docs, Figma, and other tools, and integrating it requires either building bespoke connectors (as Palantir did) or using general-purpose computer use agents.
Computer Use vs. Programmatic Integrations
Anthropic’s “computer use” capability (and similar efforts at other labs) offers a general-purpose approach: the model controls a mouse and keyboard, navigating any application.
The trade-off is that computer use requires many more tokens (10x–100x) than a direct API integration, making it more expensive and demanding a model with a long, coherent Chain of Thought—exactly what o1 provides.
Bob expects a mixed ecosystem: some tasks will use fast, specific integrations; others will fall back on general computer use when no integration exists.
He doesn’t think application-specific models (e.g., a “Salesforce model”) make sense technically; instead, application providers would benefit from making their data available to all foundation models, similar to how SEO works for search engines.
A rough timeline: when a demo looks compelling but isn’t yet practical, expect it to become usable for limited cases within a year and surprisingly effective (though not fully reliable) within two years.
Multimodal AI and the Significance of Sora
After text, image, and audio, video is the modality that has resisted integration into foundation models for the longest; Sora’s release marks a turning point.
Video is fundamentally harder than images because it requires generating extended, coherent sequences of events, not just a single frame.
It also demands new user interfaces (e.g., storyboards with checkpoints) and is extremely expensive to train and run.
Bob draws a direct analogy to the LLM trajectory: within two years, video model quality will improve significantly, and the cost of generating high-quality video will drop by orders of magnitude.
He predicts AI-generated movies that win awards within two years, but emphasizes that the creative vision will still come from a human director using the tool, not from the AI itself.
Robotics: Five Years Away from Widespread (but Limited) Adoption
Bob joined OpenAI initially to work on robotics and spent a year between Palantir and OpenAI deeply studying the space; he originally predicted widespread adoption by 2020 but was wrong—he now believes it will happen by ~2029.
Foundation models have dramatically improved robotics’ ability to generalize, especially in vision and planning.
A practical example: founders can now talk to robots in natural language instead of typing commands, making development far more accessible.
The key open question is whether robots learn in simulation or from real-world demonstrations.
Simulation works well for rigid-body tasks (e.g., pick-and-place with hard objects) but struggles with deformable materials like cloth or cardboard.
For general-purpose robotics, real-world demonstrations are likely necessary, and recent work shows this can scale.
Bob is bearish on mass consumer home robotics (safety concerns, unconstrained environments) but expects widespread deployment in warehouses, retail, and other structured work environments within five years.
One Model to Rule Them All?
Frontier labs will continue to release single models that are best-in-class across all modalities and data types they have access to.
Specialization mainly buys you price-performance: once you know what you want the AI to do, you can fine-tune a much smaller, cheaper model for that specific task.
This pattern—prototype with the frontier model, generate a dataset, then fine-tune a smaller model—is now standard across the industry.
Why AI Hasn’t Transformed Productivity (Yet)
Despite the hype, AI’s impact on GDP and productivity statistics is still minimal; the measurable economic impact so far comes from capital expenditure on data centers, not from productivity gains.
This mirrors the early internet era, when productivity gains took years to materialize.
The core reason is that AI automates tasks, not jobs; most jobs consist of many tasks, and at least one task in most jobs resists automation.
Even in programming, boilerplate code is automated first, while the hardest part—figuring out what to build—remains a human problem.
Bob is excited about startups applying AI to “boring” problems (e.g., procurement optimization, comparison shopping) where the value comes from infinite patience rather than brilliance.
Early productivity gains are showing up most among bottom-half performers, who benefit from AI’s ability to handle the technical execution they previously couldn’t manage.
What Makes a Great AI Researcher
Top researchers come in different types: some (like Alec Radford, inventor of GPT and Clip) do their best work alone at a computer; others (like Ilya Sutskever and Yakov Pachi) excel at setting vision and roadmaps that guide large teams.
The common trait among the very best is grit—a willingness to work on a single foundational problem for years if necessary.
Bob cites Adi Dio Rames, who spent 18 months to two years trying to generate a picture of a pink panda skating on ice to prove neural networks could be creative, iterating through blurry, barely recognizable outputs until it worked.
Building a research organization requires protecting researchers’ artistry and intrinsic motivation; you can’t treat them as interchangeable parts in a process.
OpenAI’s Culture of “Refounding”
OpenAI has pivoted or “refounded” itself roughly every 18–24 months, each time fundamentally changing its identity and purpose:
From nonprofit focused on publishing papers → for-profit → partnership with Microsoft → building its own API → launching ChatGPT for consumers and enterprises.
Each pivot was controversial internally, but collectively they led OpenAI to its actual mission: building one model that everyone in the world can use.
Bob attributes these shifts partly to necessity (running out of money, needing to demonstrate value) and partly to deliberate strategy (the bet on ChatGPT as a direct-to-consumer product).
The ChatGPT launch itself was somewhat accidental: the team had already trained GPT-4, and John Schulman pushed to release a chat interface to get outside feedback; they set a low bar (1,000 users would be a success) and didn’t use a waitlist, and it went viral.
Scaling Is the Last Fundamental Challenge
Bob believes reasoning was the last fundamental capability needed to reach human-level intelligence, and it has now been solved (by o1 and similar approaches).
The remaining challenge is scaling: making these techniques work reliably at larger and larger compute budgets.
Scaling is hard because it involves systems engineering, hardware, optimization, and data—not new foundational ideas, but the painstaking work of making existing ideas work at scale.
He has a “deep critique” of the concept of AGI as a single moment; instead, he expects progress to feel continuous and even banal—self-driving cars, AI assistants at work—rather than a dramatic singularity.
The Scarcity of the Future: Agency
As intelligence becomes ubiquitous and nearly free, the scarce factor of production will be agency: knowing what to do, what questions to ask, what projects to pursue.
AI can execute on a goal but struggles to set meaningful goals on behalf of humans.
Bob uses Sora as an example: a vague prompt produces a video, but only a human with specific creative intent can guide the tool to produce something truly desired.
He believes developing agency—in children and adults—is the most important thing people can do to prepare for an AI-saturated world.
AI’s Impact on Social Sciences and Academia
Bob is critical of academia’s incentive structure, which overemphasizes individual credit and discourages collaboration.
He designed OpenAI’s research organization as the mirror image of academia: collaborative, mission-oriented, and focused on building one thing rather than publishing many papers.
He sees exciting potential for AI in social science and product management, where A/B testing and user research are essentially experimental social science.
Imagine fine-tuning a model on all user interactions to create a “fake user” that reacts like a real one, enabling A/B tests without going to production, or conducting deep interviews with simulated users.
Key Decisions in OpenAI’s History
One underappreciated but critical decision was OpenAI’s choice to double down on language modeling as its central focus, shutting down more exploratory projects like the robotics and games teams (including the Dota 2 project).
This was painful at the time but concentrated resources on the path that ultimately led to GPT and ChatGPT.
The Dota 2 project was itself important because it taught the team that scaling compute could solve complex problems and developed the technological tools that later enabled large-scale language modeling.
Parenting and Education in the Age of AI
Despite his deep involvement in AI, Bob admits he’s still teaching his 8-year-old son the same things he would have taught him eight years ago: coding, math, reading, and writing.
He acknowledges this may be a failure on his part, but believes the timeless value lies in learning to think in a structured way about problems, regardless of whether AI can execute the technical details.
Why Bob Left OpenAI and What’s Next
Bob left after shipping o1 because he felt he had accomplished what he set out to do: the research program of pre-training, multimodal, and reasoning was solved, and it was time to hand off to the next generation.
He’s not in a hurry to start something new; after leaving Palantir, he spent two years exploring before landing at OpenAI, and he expects a similar period of exploration now.
He’s currently talking to robotics founders, researchers, and other people doing interesting things, and developing his own thesis about what matters next.
His advice: follow him on Twitter at @BobMcGrewAI for updates, and keep working on AI—progress will continue, it won’t slow down, but it will change in exciting ways.
Overhyped and Underhyped
Overhyped: new architectures that look interesting but tend to fall apart at scale.
Underhyped: o1—it is already hyped, but not enough; its implications for reasoning and test-time compute are underappreciated.