RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today

Unsupervised Learning 47min 7 min #40
RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today
Watch on YouTube

Summary

  • Michelle Pokrass leads the Power Users Research Team at OpenAI, focused on post-training improvements for developers and advanced users. She was instrumental in shipping GPT-4.1, a model explicitly designed for real-world developer utility rather than benchmark performance. This conversation covers how GPT-4.1 was built, the current state of AI agents, the resurgence of fine-tuning (especially RFT), and how companies should position themselves amid rapid model progress.

How GPT-4.1 Was Built for Developers

  • The core goal was to make a model that is a “joy to use” for developers, prioritizing real-world utility over benchmark scores.
    • Previous models often looked great on benchmarks but failed on basic tasks like following instructions, formatting, or handling long context.
    • The team spent the first three months just gathering evals and understanding developer pain points before any model training began.
  • Eval-driven development: The team’s north star was an internal instruction-following eval based on real API usage and direct user feedback.
    • They actively sought out “alpha users” and startups to identify what models couldn’t do yet, then hill-climbed on those specific failures.
    • Evals have a short shelf life (~3 months) due to rapid progress, so the team is constantly searching for new ones.
  • Key improvements in 4.1:
    • Instruction following: Significantly better at adhering to complex, multi-step instructions.
    • Long context: Improved performance over longer inputs, though real-world long-context evals remain hard to build.
    • Coding: Much stronger at locally scoped coding tasks; reduced irrelevant edits from 9% (GPT-4o) to 2%.
    • UI generation: Improved capabilities for building user interfaces.
    • Multimodal: Better native multimodal understanding, credited to improvements in pre-training.
  • Model family: GPT-4.1 comes in three sizes—standard, mini, and nano—each with different pre-training approaches (mid-train for standard, new pre-trains for mini and nano).
    • Nano is designed to be cheap and fast, spurring broader AI adoption across the cost-latency curve.

The Current State of AI Agents

  • What works: Agents perform remarkably well in well-scoped domains where the tools are clear and user intent is unambiguous.
  • What doesn’t: The “fuzzy and messy real world” remains challenging—users often don’t know what agents can do, and agents lack awareness of their own capabilities or real-world context.
  • Key bottlenecks:
    • Context injection: Getting the right information into the model is still the hardest part.
    • Ambiguity handling: Models need better steerability—knowing when to ask for clarification vs. proceeding with assumptions.
    • Robustness: Models can get stuck when APIs fail (e.g., 500 errors); more “grit” is needed for long-horizon tasks.
  • Benchmark saturation: Many external agentic benchmarks are saturated or misgraded; actual failure cases often involve user model issues rather than the primary model doing the wrong thing.
  • Long-term tasks: Progress requires both engineering improvements (better UIs for monitoring and steering agents) and modeling improvements (robustness, reasoning).

Coding Capabilities and Benchmarks

  • Current state: Models are excellent at locally scoped coding (e.g., changing a library where all files are nearby) but struggle with global context across many files or extremely technical cross-file dependencies.
  • Front-end coding: Improved significantly, but there’s still room to match the quality and style of a senior front-end engineer.
  • Benchmarks:
    • SWEBench remains useful for distinguishing between models at different capability levels (e.g., 55% vs. 35%).
    • Aider evals are still valuable, but many benchmarks become saturated quickly.
    • The three-month shelf life of evals means teams must constantly develop new ones.

Model Family Strategy and Generalization

  • Targeted vs. general models: GPT-4.1 was a targeted effort for developers, allowing the team to decouple from ChatGPT’s timeline and optimize specifically for coding and instruction following.
    • This meant removing some ChatGPT-specific datasets and upweighting coding data.
  • Future direction: OpenAI aims to simplify the model family and converge toward one general model that serves all use cases.
    • The philosophy is to lean into the “G” in AGI—generalization improves capabilities.
    • However, there’s room for both targeted and general approaches depending on the problem.

How Companies Should Navigate Rapid Model Progress

  • Best practices for staying current:
    • Build strong evals for your specific use case and run them on new models as they drop.
    • Be prepared to switch prompts and scaffolding to tune for particular models.
    • Build products that are “just out of reach” of current models—when new models arrive, you’ll be first to market.
  • Heuristic for what’s “just out of reach”: If fine-tuning improves performance from 10% to 50%, the capability is likely on the cusp and will probably be solved by a future base model in a few months.
  • Scaffolding: It’s worth building scaffolding to make products work today, but be prepared to remove it as models improve.
    • Trends to watch: context windows, reasoning, instruction following, and multimodal capabilities are all improving.
    • Don’t over-optimize for current limitations that will soon be resolved.

Fine-Tuning Renaissance

  • Two camps of fine-tuning:
    1. Speed and latency: SFT (supervised fine-tuning) to get faster, cheaper versions of capable models.
    2. Frontier capabilities: RFT (reinforcement fine-tuning) to push the frontier in specific domains with limited data.
  • RFT (Reinforcement Fine-Tuning):
    • Uses the same RL process OpenAI uses internally for model improvement.
    • Extremely data-efficient—can work with ~100 samples.
    • Less fragile than SFT and better for pushing capabilities in niche domains.
    • Shipping to general availability soon.
  • When to use RFT:
    • When no model on the market does what you need.
    • In domains with verifiable outcomes (e.g., chip design, drug discovery, math, code).
    • For teaching agents to pick workflows or decision processes.
  • Preference fine-tuning: Better for stylistic adjustments rather than capability improvements.

Multimodal and Domain-Specific Models

  • Multimodal: GPT-4.1’s multimodal capabilities are significantly improved and underhyped; many tasks that didn’t work in GPT-4o now work.
  • Domain-specific models: The trend is toward generalization rather than separate foundation models for robotics, biology, etc.
    • Combining everything into one model produces better results.
    • Robotics remains an open question, but the internal trend favors unified models.

Choosing the Right Model

  • Consumer use (ChatGPT):
    • GPT-4o for general chat and conversation.
    • GPT-4.5 for writing and creative tasks.
    • o3 for hard math problems or tasks requiring deep reasoning (e.g., taxes).
  • Enterprise/API use:
    • Start with GPT-4.1; if it works, move to mini or nano for speed and cost.
    • If 4.1 isn’t sufficient, try o4 mini for reasoning, then o3 for harder problems.
    • If none of those work, use RFT with o4 mini to push capabilities in your domain.

Prompting Techniques

  • Effective prompting for 4.1:
    • Use XML or structured formats for prompts.
    • Tell the model to “keep going” until it solves the problem—this significantly improves performance.
      • The team is working to make this unnecessary in future models.
  • Generalization in training: The model was trained on ~12 different diff formats to avoid burning in any specific format, ensuring it works well out of the box.

What Sophisticated Companies Do Well

  • Granular evals: The best companies break their problem into subcomponents and measure model performance on each part separately (e.g., SQL table selection vs. column selection).
  • Modular systems: Building systems that are easy to plug different solutions into, allowing for faster iteration and tuning.

Future Research Directions

  • Using models to make models better: Leveraging model signals in reinforcement learning to determine if training is on the right track.
  • Synthetic data: An incredibly powerful trend for improving models.
  • Speed of iteration: Reducing the number of GPUs and time needed to run experiments, enabling faster research cycles.
  • Agent scaling: Deep research and operator were trained deeply on specific tools, but the future is training models to be great at all kinds of tools simultaneously.
    • o3 already demonstrates this—it can do deep research-like work but faster and more flexibly.

Combining Model Families (GPT-5)

  • The challenge: Combining the conversational strengths of the GPT-4o series with the reasoning capabilities of the o3 series into one model.
    • GPT-4o is great for chat, tone matching, and being a sounding board.
    • o3 is great for hard reasoning but not for casual conversation.
  • Zero-sum trade-offs: Tailoring a model for one use case (e.g., coding) can reduce performance on another (e.g., chat).
  • The goal: Train a model that is a delightful conversationalist but also knows when to reason deeply.

Personalization and Power Users

  • Personality: Models are becoming more personalized through memory and custom instructions.
    • Enhanced memory allows models to adapt to individual users over time.
    • Steerability (e.g., “don’t use capital letters” or “don’t ask follow-up questions”) will become more prominent.
  • Power users: The team focuses on power users because what they do today, median users will do a year from now.
    • Power users include developers, advanced ChatGPT users, and anyone pushing models to their limits.

Michelle’s Journey at OpenAI

  • Background: Joined OpenAI 2.5 years ago on the API engineering team, with prior experience building high-frequency trading systems at Coinbase.
  • Transition to research: Moved to the model side to focus on improving models for developers, starting with structured outputs.
  • Team evolution: Formed and now leads the Power Users Research Team, which focuses on both API developers and advanced ChatGPT users.
  • Organizational changes: The pace of shipping remains remarkably fast, but the organization has grown so large that it’s impossible to have context on everything happening across the company.

Quickfire

  • Overhyped: Benchmarks, especially agentic ones, which are often saturated or reported with best-case numbers.
  • Underhyped: Using your own real-world evals and usage data to measure what’s actually working.
  • Changed mind on: Fine-tuning—previously skeptical, now believes RFT is worth the time for pushing frontiers in specific domains.
  • Model progress this year: Will be about the same as last year—fast, but not in a fast takeoff.
  • Excited about outside OpenAI: AI products that extend beyond the digital world, like Levels (health/fitness) and Whoop (health insights).
  • Where to learn more: OpenAI’s GPT-4.1 blog post; Michelle is on Twitter and welcomes feedback at [email protected].
Back to Unsupervised Learning