Why There Won’t be One Model, Will Hyperscalers Win Inference & AI Use-cases with PMF

Unsupervised Learning 55min 5 min #27
Why There Won’t be One Model, Will Hyperscalers Win Inference & AI Use-cases with PMF
Watch on YouTube

Summary

  • Lin Qiao, co-founder and CEO of Fireworks.ai and former leader of Meta’s PyTorch team, discusses the evolution of AI infrastructure, the rise of compound AI systems, and how Fireworks is positioning itself as a specialized inference platform for the next generation of AI applications.
    • Fireworks.ai is a generative AI platform focused on inference—delivering high-quality outputs at low latency and low cost—not just as a single-model API service, but as an orchestration layer for complex, multi-model AI workflows.
    • Lin’s background in PyTorch at Meta gives her deep insight into both model development and production-scale systems, which informs Fireworks’ approach to building infrastructure for real-world AI deployment.

The Shift from Single Models to Compound AI Systems

  • Lin argues that the future of AI lies not in monolithic models, but in compound AI systems: orchestrated workflows combining multiple specialized models, APIs, databases, and tools.
    • Single models are inherently limited by their training data, probabilistic nature, and narrow expertise—even the best models excel in some areas and fail in others.
    • Real-world applications (e.g., customer support, medical coding, financial analysis) require integrating diverse modalities (text, audio, vision), domain-specific logic, and proprietary data sources behind APIs.
    • Fireworks envisions a system where user queries are intelligently routed to the best combination of models and tools, rather than relying on one-size-fits-all inference.

Design Philosophy: Declarative Over Imperative

  • Fireworks leans toward a declarative design for AI systems, inspired by SQL and database query optimization.
    • Developers define what they want (e.g., “summarize this document with high factual accuracy”), and Fireworks handles how—selecting models, chaining steps, managing fallbacks.
    • This contrasts with imperative frameworks (like LangChain), where developers manually code every step of the workflow.
    • The goal is to hide complexity without sacrificing debuggability or control—especially critical for enterprise adoption.

Specialization Over Scale: The Rise of Small Expert Models

  • Lin believes the future belongs to hundreds of small, specialized expert models rather than a few giant general-purpose ones.
    • Training is an optimization process: models must choose which problems to prioritize, leading to uneven capabilities.
    • Smaller models fine-tuned on narrow tasks (e.g., classification, summarization, tool calling) can outperform larger models in their domain while being cheaper and faster.
    • Open-source base models (especially Llama variants) enable this ecosystem by allowing customization and post-training.
    • Enterprises increasingly want steerability and control, not just raw capability—making fine-tuning and customization essential.

Customization: Prompt Engineering First, Fine-Tuning Later

  • Fireworks sees a natural progression in how teams adopt AI:
    1. Start with prompt engineering to test feasibility and steer behavior.
    2. As prompts grow complex (thousands of lines), manageability becomes a challenge.
    3. At that point, fine-tuning absorbs the system prompt into the model itself—improving speed, cost, and consistency.
  • This shift typically happens post-product-market fit, when teams move from experimentation to scaling.
  • Fireworks is building tools to make this transition seamless, including automated fine-tuning pipelines that reduce the operational burden.

Human-in-the-Loop: The Key to Enterprise Adoption

  • Most successful GenAI applications today are human-in-the-loop, not fully autonomous.
    • Examples include AI assistants for doctors (medical scribes), teachers, customer service agents, and coders (e.g., Cursor, Sourcegraph).
    • Lin emphasizes that for production use, AI systems must be debuggable, understandable, and operable by humans—otherwise, trust and adoption suffer.
    • Fully autonomous “human-out-of-the-loop” automation remains rare due to reliability, safety, and regulatory concerns.

Evaluation: From Vibes to Rigorous Evals

  • Enterprise evaluation practices are maturing:
    • Early-stage teams rely on vibe checks (subjective quality assessment).
    • As products scale, they invest in structured evals with curated datasets to measure performance objectively.
    • AB testing is the gold standard for product impact but is slow; good eval datasets allow faster iteration.
    • Fireworks helps customers build and maintain eval datasets that reflect their specific use cases and evolving product focus.

F1: Fireworks’ Compound Reasoning System

  • F1 is Fireworks’ flagship product—a complex logical reasoning system exposed as a simple API.
    • Under the hood, it orchestrates multiple models, executes parallel and sequential function calls, and manages intermediate reasoning steps.
    • It supports advanced capabilities like multi-tool planning (e.g., searching for top cloud providers, fetching stock prices, and generating charts in one query).
    • Building F1 required solving challenges in quality control, inter-model communication, and latency optimization—comparable in complexity to database management systems.
    • Fireworks plans to GA F1 soon and expose developer-facing plugins so users can build their own reasoning systems.

Function Calling: The Orchestration Backbone

  • Function calling is critical for agentic workflows but is more complex than it appears:
    • Models must understand when and how to call tools, often in parallel or sequence, based on conversation context.
    • Tool selection among hundreds of options requires strong contextual understanding and planning.
    • Fireworks has invested over a year in tuning models for robust function calling, including support for parallel execution and complex coordination plans.
    • Demand surged after F1’s launch, with most applicants using it to build agents.

Hardware and Infrastructure Strategy

  • Fireworks abstracts hardware complexity from developers:
    • The platform dynamically routes workloads to optimal hardware (NVIDIA, AMD, etc.) based on workload characteristics (e.g., batch size, latency requirements).
    • Hardware innovation cycles have accelerated (new chips yearly), making it impractical for app teams to optimize manually.
    • Fireworks integrates with GPU clouds rather than competing with them, focusing on the inference stack above raw compute.

Hyperscalers vs. Specialized Inference Providers

  • Hyperscalers (AWS, Google, Azure) aim for vertical integration (like Apple’s iPhone), but Lin sees a role for specialized players like Fireworks.
    • Hyperscalers excel at massive-scale resource problems (data centers, power, hardware deployment).
    • Fireworks focuses on engineering craftsmanship and deep research in inference optimization, compound system orchestration, and customization.
    • Inference is not just about running models—it’s about composing them intelligently, which requires different expertise than pre-training or cloud infrastructure.

On-Device AI: Limited by Power and Capability

  • Running models locally (on desktops or mobile) is often touted for cost and privacy, but Lin is skeptical:
    • Mobile devices have strict power and thermal limits, restricting models to tiny sizes (1B–10B parameters) with limited capability.
    • Desktop offloading makes sense for some consumer apps (e.g., Zoom), but most personal data is already in the cloud, weakening the privacy argument.
    • The gap between mobile and cloud capabilities remains vast, making cloud inference necessary for high-quality AI.

Open Source and Meta’s Role

  • Meta continues to invest heavily in open-source models (Llama series) and standards (Llama Stack).
    • Lin co-designed parts of Llama Stack with Meta, providing feedback from enterprise customers.
    • Pre-training investment will continue until ROI diminishes—likely when data exhaustion hits (internet data fully crawled, synthetic data plateaued).
    • Industry focus is shifting from pre-training to post-training and inference as marginal returns on scale decrease.

Competitive Landscape: Fireworks vs. Others

  • Fireworks competes with companies like Together AI and Databricks but differentiates by:
    • Focusing on compound AI systems, not just cheap GPU access or single-model APIs.
    • Partnering with (not replacing) tools like LangChain, offering higher-level abstractions on top.
    • Avoiding the GPU cloud business—instead building an inference orchestration layer atop existing clouds.
  • The term “compound AI” was coined by UC Berkeley; multiple players are entering this space, validating its importance.

Adoption Curve: Faster Than Expected

  • Lin initially expected AI adoption to follow a sequential path: startups → digital natives → traditional enterprises.
    • Instead, all segments are adopting simultaneously, driven by GenAI’s accessibility.
    • Unlike pre-GenAI ML, which required large ML teams and custom training, GenAI lets product teams build directly on foundation models with minimal data.
    • Sales cycles are shorter, procurement is more flexible, and experimentation is rapid—even in conservative industries.

Overhyped and Underhyped

  • Overhyped: The idea that GenAI is a magical solution to all problems—users expect perfect answers from a single prompt.
  • Underhyped: The need for customization, specialization, and human oversight in real-world AI systems.

Final Thoughts

  • Fireworks offers a self-serve platform at fireworks.ai with access to hundreds of models and a playground for experimentation.
    • Lin encourages developers to share use cases and challenges directly via LinkedIn or the platform.
    • The company’s long-term vision is to make compound AI systems as easy to build and operate as traditional software.
Back to Unsupervised Learning