Tri Dao is a leading AI researcher whose work has been central to the dramatic drop in AI inference costs over the past few years. He co-authored Flash Attention, a key algorithm that reduced memory bottlenecks in transformer models, and co-developed Mamba, an alternative architecture to transformers based on state space models. He currently serves as Chief Scientist at Together, a major AI inference provider, and is an Assistant Professor at Princeton. In this conversation, he discusses the future of AI hardware, the diversification of inference workloads, the role of AI in writing performant code, and his research directions in robotics and architecture design.
Nvidia’s Dominance and the Competitive Landscape
Nvidia dominates AI hardware (~90% of workloads) due to strong chip design and a mature software ecosystem.
Competitors like AMD, Cerebras, Groq, and SambaNova are gaining traction in specific niches:
Inference: Low-latency use cases (e.g., coding assistants) favor specialized chips.
High-throughput batch inference: For synthetic data generation or RL training.
Chip design requires betting on future workloads 2–3 years out. While transformer architecture has stabilized at a high level, underlying changes (e.g., Mixture of Experts, new attention variants like DeepSeek’s Multi-Head Latent Attention) create uncertainty.
Startups must make bold bets on emerging workloads (e.g., video generation, robotics) rather than trying to match incumbents on general performance.
Inference Cost Declines: Key Drivers
Since ChatGPT’s launch, inference costs have dropped roughly 100x, driven by:
Better models per parameter: Improved architectures and training data.
Quantization: Moving from 16-bit to 8-bit or 4-bit weights (e.g., GPT-OSS uses 4-bit for most layers, fitting 120B parameters in ~60GB).
Flash Attention: Reduced memory access bottlenecks by redesigning attention to minimize data movement.
Mixture of Experts (MoE): Activating only a fraction of model parameters per token (e.g., GPT-OSS uses 4 out of 128 experts per layer).
KV cache compression: Techniques like DeepSeek’s latent projection shrink the history stored for attention.
Hardware-software co-design: Closer collaboration between model designers and chip/kernel developers.
Future Cost and Speed Improvements
Tri expects another 10x improvement in inference efficiency within a year, from:
Hardware: Better native support for low-precision math, improved networking for multi-chip models (~2–3x).
Model architecture: Continued sparsity (MoE), state-space models for large-batch inference (~2–3x).
Kernel optimization: Community-driven improvements in low-level GPU code (~2x).
Diversifying Inference Workloads
Three primary workload patterns are emerging:
Interactive chatbots: Moderate latency needs; balance between speed and cost.
Low-latency agentic tasks: Coding assistants (e.g., Claude Code) where speed directly impacts user productivity; users willing to pay a premium.
High-throughput batch jobs: Synthetic data generation, RL rollouts; prioritize throughput over latency.
Fleet-level optimization allows providers like Together to dynamically allocate GPU resources (e.g., offering 50% discounts for batch API jobs during off-peak hours).
New workloads like real-time video generation (e.g., Pika, HeyGen) and agentic systems (tool use, database access) are creating new optimization challenges beyond pure model inference.
AI-Assisted Kernel Development
Fully automatic GPU kernel generation by LLMs is still early; models struggle with complex, correct kernel code due to limited high-quality training data.
However, AI tools like Claude Code are already useful as collaborators:
Tri reports a 1.5x productivity boost using Claude Code for writing Triton kernels.
Models excel at high-level optimization suggestions and boilerplate code, while humans handle design and debugging.
Future milestones include better agentic capabilities: knowing when to consult documentation, compilers, or profilers.
Hardware Portability and Abstractions
True hardware portability is elusive; even Nvidia requires significant software rewrites every generation due to architectural changes.
Triton is a promising abstraction layer, supporting Nvidia, AMD, and Intel GPUs, though with potential performance trade-offs.
New domain-specific languages (e.g., Mojo, ThunderKittens, Mosaic) aim to simplify performant kernel writing, but the field is still iterating rapidly.
Tri emphasizes designing abstractions that work for both humans and AI tools.
The Path to Expert-Level AI
Current models perform at median human level on data-rich tasks (e.g., front-end coding) but lag in expert domains (e.g., hardware design, law) where data is scarce and tool use is critical.
Tri’s key research question: How can AI reach expert-level capability?
This requires models to work alongside humans using specialized tools and workflows.
Architectural innovations (e.g., sparser MoE, hybrid transformer-state space models) may reduce the cost of reaching AGI/ASI by 10x.
Alternative Architectures and Robotics
Tri remains excited about Mixture of Experts (increasing sparsity) and state-space models (e.g., Mamba) for efficient long-context and large-batch inference.
Robotics is a new research focus:
Challenges include multi-timescale processing (fast joint control vs. slow route planning) and lack of real-world actuation data.
Potential solution: composite systems initialized from foundation models (language, vision, audio) but stitched together for real-world interaction.
Academia vs. Industry
Tri balances roles at Together (fast execution, product impact) and Princeton (longer-horizon, speculative research).
Academia explores many ideas (e.g., attention, Adam optimizer) with low immediate payoff but high long-term impact; industry exploits and scales the most promising ones.
Current venture funding is blurring this distinction, with well-funded research labs (e.g., by Ilya Sutskever) pursuing long-term bets without near-term monetization.
Quickfire Insights
Changed mind: AI models are surprisingly useful for expert-level daily work (math, coding), boosting his productivity by 1.5x.
Open vs. closed source: Open-source models will close the gap with closed-source in one year, as scaling shifts toward RL and tooling rather than pure compute.
Underhyped development: Data—especially synthetic data generation and rephrasing—is underrated.
Favorite application: Viral TikTok videos generated by Pika and HeyGen using models trained and served on Together.
Where to Follow Tri Dao
Together blog: Technical posts on inference optimization.
Twitter: @tri_dao.
Personal website: tri-dao.me (occasional blog posts).