Michelle Pokrass leads the Power Users Research Team at OpenAI, focused on post-training improvements for developers and advanced users. She was instrumental in shipping GPT-4.1, a model explicitly designed for real-world developer utility rather than benchmark performance. This conversation covers how GPT-4.1 was built, the current state of AI agents, the resurgence of fine-tuning (especially RFT), and how companies should position themselves amid rapid model progress.
How GPT-4.1 Was Built for Developers
The core goal was to make a model that is a “joy to use” for developers, prioritizing real-world utility over benchmark scores.
Previous models often looked great on benchmarks but failed on basic tasks like following instructions, formatting, or handling long context.
The team spent the first three months just gathering evals and understanding developer pain points before any model training began.
Eval-driven development: The team’s north star was an internal instruction-following eval based on real API usage and direct user feedback.
They actively sought out “alpha users” and startups to identify what models couldn’t do yet, then hill-climbed on those specific failures.
Evals have a short shelf life (~3 months) due to rapid progress, so the team is constantly searching for new ones.
Key improvements in 4.1:
Instruction following: Significantly better at adhering to complex, multi-step instructions.
Long context: Improved performance over longer inputs, though real-world long-context evals remain hard to build.
Coding: Much stronger at locally scoped coding tasks; reduced irrelevant edits from 9% (GPT-4o) to 2%.
UI generation: Improved capabilities for building user interfaces.
Multimodal: Better native multimodal understanding, credited to improvements in pre-training.
Model family: GPT-4.1 comes in three sizes—standard, mini, and nano—each with different pre-training approaches (mid-train for standard, new pre-trains for mini and nano).
Nano is designed to be cheap and fast, spurring broader AI adoption across the cost-latency curve.
The Current State of AI Agents
What works: Agents perform remarkably well in well-scoped domains where the tools are clear and user intent is unambiguous.
What doesn’t: The “fuzzy and messy real world” remains challenging—users often don’t know what agents can do, and agents lack awareness of their own capabilities or real-world context.
Key bottlenecks:
Context injection: Getting the right information into the model is still the hardest part.
Ambiguity handling: Models need better steerability—knowing when to ask for clarification vs. proceeding with assumptions.
Robustness: Models can get stuck when APIs fail (e.g., 500 errors); more “grit” is needed for long-horizon tasks.
Benchmark saturation: Many external agentic benchmarks are saturated or misgraded; actual failure cases often involve user model issues rather than the primary model doing the wrong thing.
Long-term tasks: Progress requires both engineering improvements (better UIs for monitoring and steering agents) and modeling improvements (robustness, reasoning).
Coding Capabilities and Benchmarks
Current state: Models are excellent at locally scoped coding (e.g., changing a library where all files are nearby) but struggle with global context across many files or extremely technical cross-file dependencies.
Front-end coding: Improved significantly, but there’s still room to match the quality and style of a senior front-end engineer.
Benchmarks:
SWEBench remains useful for distinguishing between models at different capability levels (e.g., 55% vs. 35%).
Aider evals are still valuable, but many benchmarks become saturated quickly.
The three-month shelf life of evals means teams must constantly develop new ones.
Model Family Strategy and Generalization
Targeted vs. general models: GPT-4.1 was a targeted effort for developers, allowing the team to decouple from ChatGPT’s timeline and optimize specifically for coding and instruction following.
This meant removing some ChatGPT-specific datasets and upweighting coding data.
Future direction: OpenAI aims to simplify the model family and converge toward one general model that serves all use cases.
The philosophy is to lean into the “G” in AGI—generalization improves capabilities.
However, there’s room for both targeted and general approaches depending on the problem.
How Companies Should Navigate Rapid Model Progress
Best practices for staying current:
Build strong evals for your specific use case and run them on new models as they drop.
Be prepared to switch prompts and scaffolding to tune for particular models.
Build products that are “just out of reach” of current models—when new models arrive, you’ll be first to market.
Heuristic for what’s “just out of reach”: If fine-tuning improves performance from 10% to 50%, the capability is likely on the cusp and will probably be solved by a future base model in a few months.
Scaffolding: It’s worth building scaffolding to make products work today, but be prepared to remove it as models improve.
Trends to watch: context windows, reasoning, instruction following, and multimodal capabilities are all improving.
Don’t over-optimize for current limitations that will soon be resolved.
Fine-Tuning Renaissance
Two camps of fine-tuning:
Speed and latency: SFT (supervised fine-tuning) to get faster, cheaper versions of capable models.
Frontier capabilities: RFT (reinforcement fine-tuning) to push the frontier in specific domains with limited data.
RFT (Reinforcement Fine-Tuning):
Uses the same RL process OpenAI uses internally for model improvement.
Extremely data-efficient—can work with ~100 samples.
Less fragile than SFT and better for pushing capabilities in niche domains.
Shipping to general availability soon.
When to use RFT:
When no model on the market does what you need.
In domains with verifiable outcomes (e.g., chip design, drug discovery, math, code).
For teaching agents to pick workflows or decision processes.
Preference fine-tuning: Better for stylistic adjustments rather than capability improvements.
Multimodal and Domain-Specific Models
Multimodal: GPT-4.1’s multimodal capabilities are significantly improved and underhyped; many tasks that didn’t work in GPT-4o now work.
Domain-specific models: The trend is toward generalization rather than separate foundation models for robotics, biology, etc.
Combining everything into one model produces better results.
Robotics remains an open question, but the internal trend favors unified models.
Choosing the Right Model
Consumer use (ChatGPT):
GPT-4o for general chat and conversation.
GPT-4.5 for writing and creative tasks.
o3 for hard math problems or tasks requiring deep reasoning (e.g., taxes).
Enterprise/API use:
Start with GPT-4.1; if it works, move to mini or nano for speed and cost.
If 4.1 isn’t sufficient, try o4 mini for reasoning, then o3 for harder problems.
If none of those work, use RFT with o4 mini to push capabilities in your domain.
Prompting Techniques
Effective prompting for 4.1:
Use XML or structured formats for prompts.
Tell the model to “keep going” until it solves the problem—this significantly improves performance.
The team is working to make this unnecessary in future models.
Generalization in training: The model was trained on ~12 different diff formats to avoid burning in any specific format, ensuring it works well out of the box.
What Sophisticated Companies Do Well
Granular evals: The best companies break their problem into subcomponents and measure model performance on each part separately (e.g., SQL table selection vs. column selection).
Modular systems: Building systems that are easy to plug different solutions into, allowing for faster iteration and tuning.
Future Research Directions
Using models to make models better: Leveraging model signals in reinforcement learning to determine if training is on the right track.
Synthetic data: An incredibly powerful trend for improving models.
Speed of iteration: Reducing the number of GPUs and time needed to run experiments, enabling faster research cycles.
Agent scaling: Deep research and operator were trained deeply on specific tools, but the future is training models to be great at all kinds of tools simultaneously.
o3 already demonstrates this—it can do deep research-like work but faster and more flexibly.
Combining Model Families (GPT-5)
The challenge: Combining the conversational strengths of the GPT-4o series with the reasoning capabilities of the o3 series into one model.
GPT-4o is great for chat, tone matching, and being a sounding board.
o3 is great for hard reasoning but not for casual conversation.
Zero-sum trade-offs: Tailoring a model for one use case (e.g., coding) can reduce performance on another (e.g., chat).
The goal: Train a model that is a delightful conversationalist but also knows when to reason deeply.
Personalization and Power Users
Personality: Models are becoming more personalized through memory and custom instructions.
Enhanced memory allows models to adapt to individual users over time.
Steerability (e.g., “don’t use capital letters” or “don’t ask follow-up questions”) will become more prominent.
Power users: The team focuses on power users because what they do today, median users will do a year from now.
Power users include developers, advanced ChatGPT users, and anyone pushing models to their limits.
Michelle’s Journey at OpenAI
Background: Joined OpenAI 2.5 years ago on the API engineering team, with prior experience building high-frequency trading systems at Coinbase.
Transition to research: Moved to the model side to focus on improving models for developers, starting with structured outputs.
Team evolution: Formed and now leads the Power Users Research Team, which focuses on both API developers and advanced ChatGPT users.
Organizational changes: The pace of shipping remains remarkably fast, but the organization has grown so large that it’s impossible to have context on everything happening across the company.
Quickfire
Overhyped: Benchmarks, especially agentic ones, which are often saturated or reported with best-case numbers.
Underhyped: Using your own real-world evals and usage data to measure what’s actually working.
Changed mind on: Fine-tuning—previously skeptical, now believes RFT is worth the time for pushing frontiers in specific domains.
Model progress this year: Will be about the same as last year—fast, but not in a fast takeoff.
Excited about outside OpenAI: AI products that extend beyond the digital world, like Levels (health/fitness) and Whoop (health insights).
Where to learn more: OpenAI’s GPT-4.1 blog post; Michelle is on Twitter and welcomes feedback at [email protected].