RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today — Unsupervised Learning

Michelle Pokrass leads the Power Users Research Team at OpenAI, focused on post-training improvements for developers and advanced users. She was instrumental in shipping GPT-4.1, a model explicitly designed for real-world developer utility rather than benchmark performance. This conversation covers how GPT-4.1 was built, the current state of AI agents, the resurgence of fine-tuning (especially RFT), and how companies should position themselves amid rapid model progress.

How GPT-4.1 Was Built for Developers

The core goal was to make a model that is a “joy to use” for developers, prioritizing real-world utility over benchmark scores.
- Previous models often looked great on benchmarks but failed on basic tasks like following instructions, formatting, or handling long context.
- The team spent the first three months just gathering evals and understanding developer pain points before any model training began.
Eval-driven development: The team’s north star was an internal instruction-following eval based on real API usage and direct user feedback.
- They actively sought out “alpha users” and startups to identify what models couldn’t do yet, then hill-climbed on those specific failures.
- Evals have a short shelf life (~3 months) due to rapid progress, so the team is constantly searching for new ones.
Key improvements in 4.1:
- Instruction following: Significantly better at adhering to complex, multi-step instructions.
- Long context: Improved performance over longer inputs, though real-world long-context evals remain hard to build.
- Coding: Much stronger at locally scoped coding tasks; reduced irrelevant edits from 9% (GPT-4o) to 2%.
- UI generation: Improved capabilities for building user interfaces.
- Multimodal: Better native multimodal understanding, credited to improvements in pre-training.
Model family: GPT-4.1 comes in three sizes—standard, mini, and nano—each with different pre-training approaches (mid-train for standard, new pre-trains for mini and nano).
- Nano is designed to be cheap and fast, spurring broader AI adoption across the cost-latency curve.

The Current State of AI Agents

What works: Agents perform remarkably well in well-scoped domains where the tools are clear and user intent is unambiguous.
What doesn’t: The “fuzzy and messy real world” remains challenging—users often don’t know what agents can do, and agents lack awareness of their own capabilities or real-world context.
Key bottlenecks:
- Context injection: Getting the right information into the model is still the hardest part.
- Ambiguity handling: Models need better steerability—knowing when to ask for clarification vs. proceeding with assumptions.
- Robustness: Models can get stuck when APIs fail (e.g., 500 errors); more “grit” is needed for long-horizon tasks.
Benchmark saturation: Many external agentic benchmarks are saturated or misgraded; actual failure cases often involve user model issues rather than the primary model doing the wrong thing.
Long-term tasks: Progress requires both engineering improvements (better UIs for monitoring and steering agents) and modeling improvements (robustness, reasoning).

Coding Capabilities and Benchmarks

Current state: Models are excellent at locally scoped coding (e.g., changing a library where all files are nearby) but struggle with global context across many files or extremely technical cross-file dependencies.
Front-end coding: Improved significantly, but there’s still room to match the quality and style of a senior front-end engineer.
Benchmarks:
- SWEBench remains useful for distinguishing between models at different capability levels (e.g., 55% vs. 35%).
- Aider evals are still valuable, but many benchmarks become saturated quickly.
- The three-month shelf life of evals means teams must constantly develop new ones.

Model Family Strategy and Generalization

Targeted vs. general models: GPT-4.1 was a targeted effort for developers, allowing the team to decouple from ChatGPT’s timeline and optimize specifically for coding and instruction following.
- This meant removing some ChatGPT-specific datasets and upweighting coding data.
Future direction: OpenAI aims to simplify the model family and converge toward one general model that serves all use cases.
- The philosophy is to lean into the “G” in AGI—generalization improves capabilities.
- However, there’s room for both targeted and general approaches depending on the problem.

How Companies Should Navigate Rapid Model Progress

Best practices for staying current:
- Build strong evals for your specific use case and run them on new models as they drop.
- Be prepared to switch prompts and scaffolding to tune for particular models.
- Build products that are “just out of reach” of current models—when new models arrive, you’ll be first to market.
Heuristic for what’s “just out of reach”: If fine-tuning improves performance from 10% to 50%, the capability is likely on the cusp and will probably be solved by a future base model in a few months.
Scaffolding: It’s worth building scaffolding to make products work today, but be prepared to remove it as models improve.
- Trends to watch: context windows, reasoning, instruction following, and multimodal capabilities are all improving.
- Don’t over-optimize for current limitations that will soon be resolved.

Fine-Tuning Renaissance

Two camps of fine-tuning:
1. Speed and latency: SFT (supervised fine-tuning) to get faster, cheaper versions of capable models.
2. Frontier capabilities: RFT (reinforcement fine-tuning) to push the frontier in specific domains with limited data.
RFT (Reinforcement Fine-Tuning):
- Uses the same RL process OpenAI uses internally for model improvement.
- Extremely data-efficient—can work with ~100 samples.
- Less fragile than SFT and better for pushing capabilities in niche domains.
- Shipping to general availability soon.
When to use RFT:
- When no model on the market does what you need.
- In domains with verifiable outcomes (e.g., chip design, drug discovery, math, code).
- For teaching agents to pick workflows or decision processes.
Preference fine-tuning: Better for stylistic adjustments rather than capability improvements.

Multimodal and Domain-Specific Models

Multimodal: GPT-4.1’s multimodal capabilities are significantly improved and underhyped; many tasks that didn’t work in GPT-4o now work.
Domain-specific models: The trend is toward generalization rather than separate foundation models for robotics, biology, etc.
- Combining everything into one model produces better results.
- Robotics remains an open question, but the internal trend favors unified models.

Choosing the Right Model

Consumer use (ChatGPT):
- GPT-4o for general chat and conversation.
- GPT-4.5 for writing and creative tasks.
- o3 for hard math problems or tasks requiring deep reasoning (e.g., taxes).
Enterprise/API use:
- Start with GPT-4.1; if it works, move to mini or nano for speed and cost.
- If 4.1 isn’t sufficient, try o4 mini for reasoning, then o3 for harder problems.
- If none of those work, use RFT with o4 mini to push capabilities in your domain.

Prompting Techniques

Effective prompting for 4.1:
- Use XML or structured formats for prompts.
- Tell the model to “keep going” until it solves the problem—this significantly improves performance.
  - The team is working to make this unnecessary in future models.
Generalization in training: The model was trained on ~12 different diff formats to avoid burning in any specific format, ensuring it works well out of the box.

What Sophisticated Companies Do Well

Granular evals: The best companies break their problem into subcomponents and measure model performance on each part separately (e.g., SQL table selection vs. column selection).
Modular systems: Building systems that are easy to plug different solutions into, allowing for faster iteration and tuning.

Future Research Directions

Using models to make models better: Leveraging model signals in reinforcement learning to determine if training is on the right track.
Synthetic data: An incredibly powerful trend for improving models.
Speed of iteration: Reducing the number of GPUs and time needed to run experiments, enabling faster research cycles.
Agent scaling: Deep research and operator were trained deeply on specific tools, but the future is training models to be great at all kinds of tools simultaneously.
- o3 already demonstrates this—it can do deep research-like work but faster and more flexibly.

Combining Model Families (GPT-5)

The challenge: Combining the conversational strengths of the GPT-4o series with the reasoning capabilities of the o3 series into one model.
- GPT-4o is great for chat, tone matching, and being a sounding board.
- o3 is great for hard reasoning but not for casual conversation.
Zero-sum trade-offs: Tailoring a model for one use case (e.g., coding) can reduce performance on another (e.g., chat).
The goal: Train a model that is a delightful conversationalist but also knows when to reason deeply.

Personalization and Power Users

Personality: Models are becoming more personalized through memory and custom instructions.
- Enhanced memory allows models to adapt to individual users over time.
- Steerability (e.g., “don’t use capital letters” or “don’t ask follow-up questions”) will become more prominent.
Power users: The team focuses on power users because what they do today, median users will do a year from now.
- Power users include developers, advanced ChatGPT users, and anyone pushing models to their limits.

Michelle’s Journey at OpenAI

Background: Joined OpenAI 2.5 years ago on the API engineering team, with prior experience building high-frequency trading systems at Coinbase.
Transition to research: Moved to the model side to focus on improving models for developers, starting with structured outputs.
Team evolution: Formed and now leads the Power Users Research Team, which focuses on both API developers and advanced ChatGPT users.
Organizational changes: The pace of shipping remains remarkably fast, but the organization has grown so large that it’s impossible to have context on everything happening across the company.

Quickfire

Overhyped: Benchmarks, especially agentic ones, which are often saturated or reported with best-case numbers.
Underhyped: Using your own real-world evals and usage data to measure what’s actually working.
Changed mind on: Fine-tuning—previously skeptical, now believes RFT is worth the time for pushing frontiers in specific domains.
Model progress this year: Will be about the same as last year—fast, but not in a fast takeoff.
Excited about outside OpenAI: AI products that extend beyond the digital world, like Levels (health/fitness) and Whoop (health insights).
Where to learn more: OpenAI’s GPT-4.1 blog post; Michelle is on Twitter and welcomes feedback at [email protected].

Summary

How GPT-4.1 Was Built for Developers

The Current State of AI Agents

Coding Capabilities and Benchmarks

Model Family Strategy and Generalization

How Companies Should Navigate Rapid Model Progress

Fine-Tuning Renaissance

Multimodal and Domain-Specific Models

Choosing the Right Model

Prompting Techniques

What Sophisticated Companies Do Well

Future Research Directions

Combining Model Families (GPT-5)

Personalization and Power Users

Michelle’s Journey at OpenAI

Quickfire