Two weeks after OpenAI released a new suite of agent-building tools, Nikunj Handa (Product Lead) and Steve from the OpenAI API platform team joined the Unsupervised Learning podcast to discuss how developers and enterprises should think about building agentic systems. The conversation covers OpenAI’s long-term vision for agents, the new Responses API and Agents SDK, computer use models, reinforcement fine-tuning, the evolving tools ecosystem, and practical advice for companies preparing for an agentic future.
OpenAI’s Vision for Consumer Interaction
Right now, agentic experiences are concentrated in first-party surfaces like ChatGPT, Deep Research, and Operator, but the real shift will come when these capabilities are embedded into the products people already use every day.
Computer use models will automate form-filling, clicking, and research tasks inside browsers and work applications.
Operator-like behavior will appear in everyday workflows rather than requiring users to go to a dedicated AI product.
The API platform’s role is to disperse these capabilities across the web, letting developers embed agents into vertical-specific products far more diverse than OpenAI could build itself.
How Agents Access and Communicate on the Web
In 2024, agents typically did a single turn: decide whether to search the web, retrieve information, and synthesize a response.
In 2025, products like Deep Research represent a shift toward multi-turn, chain-of-thought tool calling, where the model retrieves information, reconsiders its stance, opens pages in parallel, and backtracks when it detects it’s going down the wrong path.
The next evolution is agents communicating with other agents seamlessly, where an endpoint just returns useful information without the calling agent needing to know whether it’s talking to a human, a traditional API, or another AI agent.
This will blur the line between internet data and private data/agent calls within the reasoning process.
Building Multi-Agent Systems for Business
Developers are already building multi-agent architectures to solve complex business problems, such as customer support automation with separate agents for refunds, billing, shipping, FAQ retrieval, and human escalation.
OpenAI released the Agents SDK to make it easier to build these multi-agent systems.
Companies should start by building AI agents internally to solve real problems today, and expose them to the public internet when it becomes clear that external agents need to interact with them.
This public exposure layer is expected to emerge naturally in the coming months.
Where Agents Work Today and Where They’re Headed
In 2024, most agentic products used deterministic workflows with fewer than 10–15 tools, carefully orchestrated step by step.
In 2025, models are smart enough to figure out tool-calling paths dynamically within their reasoning process, moving away from rigid workflow design.
The next major unlock is removing the tool count constraint entirely, exposing agents to hundreds of tools and letting the model figure out which ones to call.
Today’s models still struggle with this, but reinforcement fine-tuning (RFT) and better model generations are expected to close the gap.
Increasing agent runtime from minutes to hours or days will also unlock more powerful results, analogous to how a human worker can spend a full day on a task using whatever tools are needed.
Reinforcement Fine-Tuning and Domain Specialization
Reinforcement fine-tuning lets developers create tasks and graders that steer a model’s chain of thought toward domain-specific reasoning, effectively training it to think like a legal scholar or medical doctor.
OpenAI provides flexible graders that let developers compare model outputs against ground truth (e.g., a medical textbook) or execute code to verify mathematical correctness, going beyond simple string matching.
The biggest unsolved problem is productizing task and grader creation so that non-experts can build high-quality evals for their specific domains.
Building good tasks and graders remains extremely challenging and iterative, even for OpenAI’s own teams building products like Operator and Deep Research.
Computer Use Models
Computer use was initially expected to be most valuable for legacy applications without APIs, and it is being used in medical domains for highly manual, multi-application workflows.
Surprisingly, it’s also being used for tasks that do have APIs but are poorly suited to structured data extraction, such as using Google Maps Street View to verify whether a climate tech company has expanded its charging network.
These vision-plus-text tasks where information doesn’t map clean to JSON are especially well-suited for computer use.
Platform plays around computer use are emerging, such as Browserbase and Scrappy Bar (a YC startup), which provide hosted virtual machines optimized for computer use models.
During alpha testing, people tried computer use in unexpected environments (iPhone screenshots, Android), suggesting demand for specialized VM providers (e.g., iOS VMs for AI testing).
Computer use models are still at an early stage (described as “GPT-1 or GPT-2” of the paradigm) but are expected to improve rapidly.
The Responses API and Agents SDK
The Responses API is designed as the foundation for multi-turn agentic interactions, supporting multiple model turns and multiple tool turns within a single request.
It combines the best of the Assistants API (tool use, multiple outputs) with the simplicity of the Chat Completions API (easy to get started, no forced context storage).
It follows an “APIs as ladders” philosophy: simple out of the box (four-line quickstart), with progressively more knobs available (chunk size, metadata filtering, re-ranker customization, etc.).
The Agents SDK introduces multi-agent orchestration as a first-class pattern, splitting tasks across specialized agents (analogous to single-processor vs. multi-processor computing), which improves per-task efficacy and makes debugging easier by isolating blast radius.
MCP (Model Context Protocol) and the Responses API are complementary: MCP handles how tools are brought to models, while the Responses API handles multi-turn interaction patterns.
AI Infrastructure and the Role of Startups
OpenAI is building more out-of-the-box tools (web search, file search, computer use) because users want a one-stop shop, but there remains a large market for low-level, highly flexible AI infrastructure APIs.
Verticalized AI infrastructure companies will thrive by serving specific niches, such as VMs optimized for coding AI startups (e.g., Runloop) or specialized testing environments for different operating systems.
LLM operations companies that help developers manage prompts, billing, and usage across multiple providers (e.g., OpenRouter) represent another important category.
The tools ecosystem is still early, and figuring out how to help enterprises securely deploy and observe computer use VMs in their own infrastructure is a major open problem.
What’s Still Hard for Developers
The stack-ranked problems that make working with models painful today include:
Building a robust tools ecosystem on top of the foundational Responses API.
Making eval and task creation dramatically easier (the biggest bottleneck).
Helping enterprises deploy computer use VMs securely and observably.
Model progress is expected to accelerate this year, driven by a feedback loop where models help generate better training data.
Smaller, faster models optimized for tool use, classification, and guardrailing (the “workhorse” models that sit alongside frontier models like o1) represent a significant opportunity.
What Sophisticated Developers Are Doing
The most capable users treat individual tools as steps in a larger workflow rather than expecting any single tool call to solve the problem, chaining deterministic and LLM steps together.
Multi-agent architectures make workflows easier to debug and iterate on because each agent has a narrow scope, reducing the blast radius of prompt changes.
Orchestration skill, the ability to combine tools, data, and multiple model calls and rapidly evaluate and improve the system, is seen as the most important differentiator for AI application builders over the next one to two years.
Advice for Enterprise and Consumer CEOs
Start exploring frontier models and computer use now by taking a few internal manual workflows and attempting to automate them end to end with multi-agent architectures.
The biggest barrier to automation is often not the LLM itself but getting programmatic access to the tools and applications employees use; computer use can serve as a bridge while that API access is being built.
For individual contributors and managers: ask employees what their least favorite daily tasks are and target those for automation, following the historical pattern of developers automating away the bottom 20% of their work.
Quickfire
Overhyped and underhyped: Agents are both overhyped (two full hype cycles in) and underhyped (companies that figure out fully automated workflows like Deep Research gain enormous leverage).
Changed mind in the last year: The power of reasoning models combined with tool use to create products like Operator and Deep Research was underestimated; the shift from deterministic workflows to fully agentic products that reason through tool use in their chain of thought has been transformative.
Fine-tuning broadly: The ability to add custom knowledge and behavior to models post-training and see significant task-specific improvement was more impactful than expected.
Biggest differentiator for application builders: Orchestration, combining tools, data, and multiple models (via RFT or multi-LLM chaining) and rapidly evaluating and improving the system.
Underexplored applications: Scientific research (where o-series models were expected to drive a step change) and robotics; the right interface for academia in particular remains an open problem.
Model progress this year vs. last: Expected to be greater, driven by a feedback loop where models help generate better training data.
Most excited startup category: AI travel agents, an entrenched industry dominated by a handful of large players where no compelling AI product exists yet.
Favorite AI tool: Granola, an AI meeting note-taker.
Where to learn more: platform.openai.com/docs, the @OpenAIDevs Twitter account, and community.openai.com.