The 1-Year Old AI Startup That’s Rivaling OpenAI

Unsupervised Learning 40min 6 min #10
The 1-Year Old AI Startup That’s Rivaling OpenAI
Watch on YouTube

Summary

  • Mistral AI, co-founded by CEO Arthur Mensch, has rapidly become one of the most prominent open-source LLM companies, positioning itself as a credible challenger to closed-source incumbents like OpenAI, Anthropic, and Google. In this episode of Unsupervised Learning, Arthur discusses Mistral’s strategy, the future of AI policy, the next frontiers for large language models, and what it’s like building a world-class AI startup from Paris.

Origins and Brand Identity

  • The company’s name has two origin stories: the official one is a play on the French phrase “intelligence artificielle” (IA) combined with “Mistral,” a cold wind from southern France, giving “Mistral AI” — a “Wind of Change.” The real story is that the founders struggled to pick a name and settled on Mistral because it sounded good.
  • The now-iconic logo was created hastily when someone hijacked Mistral’s original Twitter account. Arthur was at a weekend wedding, co-founder Guillaume was busy, and they threw together a WordArt-style logo as a temporary fix — it stuck and became a defining brand element.
  • Mistral’s decision to release early models via torrent was a deliberate nod to how LLaMA first spread, and it resonated strongly with the developer community.

Open-Source vs. Closed-Source Strategy

  • Arthur believes open source will ultimately prevail because LLMs are infrastructure technology that customers should be able to own and modify. Mistral currently maintains both open-source and commercial offerings, with the business model designed to sustain open-source development.
  • The best-performing models are released commercially (e.g., Mistral Small, Large, Embed), while slightly older models are open-sourced. However, Arthur emphasizes this is tactical and evolving — there’s no fixed commitment, as competitive and commercial pressures shift constantly.
  • Mistral Large is offered as a “portable” solution: available on Mistral’s cloud but also shipped with model weights to enterprise customers, giving them the same flexibility as open-source deployments.

Enterprise Platform and Partnerships

  • Mistral’s core competency is training models. Specialization and fine-tuning are growing strengths. Running inference is necessary but not their strongest suit, so they’ve built their own inference pipeline while also leveraging partnerships.
  • Recent partnerships with Microsoft Azure, Snowflake, Databricks, and Nvidia are driven by a distribution-first strategy: meet developers and enterprises where they already operate. If a developer on Azure wants Mistral, they should find it in Azure AI studio without friction.
  • Smaller companies and digital natives tend to go directly to Mistral (getting dedicated Slack support), while larger enterprises — especially in Europe — prefer accessing Mistral through existing cloud procurement channels like Azure credits.

The Next Frontiers for LLMs

  • Efficiency: Mistral’s 7B model was already highly compressed, but Arthur believes there’s significant room to improve further. Chinchilla-style scaling laws suggest models can be much smaller for equivalent performance, and Mistral has demonstrated additional compression beyond that.
  • Scaling: Arthur doesn’t believe scaling has saturated. There’s still room to make models better with more compute, though data limitations and diminishing returns are open questions.
  • Controllability: Making models follow instructions precisely remains an active research area. Beyond pre-training compute, fine-tuning and alignment techniques can make models substantially stronger.
  • Architecture: Arthur is skeptical of claims that non-Transformer architectures will replace Transformers soon. Everything — training algorithms, debugging tools, hardware — has co-adapted to Transformers over seven years, making it extremely hard for a new architecture to catch up. Mistral has proposed improvements like sparse attention for memory efficiency, but expects the Transformer backbone to persist.
  • Speed and latency: Making models “think faster” opens up applications where LLMs are used as basic building blocks for planning and exploration, not just single-turn responses.

Compute and Competition with Meta

  • Meta’s announcement of 600,000 GPUs dwarfs Mistral’s compute, but Arthur argues that efficiency matters more than raw scale. Mistral has achieved state-of-the-art results with roughly 1,500 H100s, with plans to scale up in coming months.
  • The key challenge is unit economics: ensuring every dollar spent on training compute generates more than a dollar in revenue. Being efficient with training compute is essential for a viable business model.
  • On catching up to ChatGPT: Arthur is cautious about timelines, noting many unknowns around saturation and data. But he’s confident Mistral can push efficiency further, having already demonstrated significant compression beyond Chinchilla-optimal scaling.

AI Regulation and the EU AI Act

  • Arthur’s core position is that AI safety should be addressed through a product safety lens, not a technology regulation lens. LLMs are like programming languages — general-purpose tools that can be used for anything — so regulating the model itself doesn’t ensure the products built on top are safe.
  • The EU AI Act introduced technology-level regulation (e.g., flop thresholds triggering evaluations) due to lobbying, which Arthur sees as a misdirected burden. Mistral already evaluates and documents its models, so compliance is manageable, but it doesn’t solve the real problem.
  • The harder problem is building tools for application-level evaluation: helping developers verify that their AI products perform as expected. This is a technological and product challenge, not primarily a regulatory one.
  • Arthur supports pressuring application makers to verify their tasks are well-solved (analogous to crash testing for cars), which creates second-order pressure on model makers to provide controllable, evaluable models.
  • He warns that regulating the technology layer directly favors big players who can send armies of lawyers to capture regulators, creating unhealthy competition.
  • On training data transparency: Arthur supports the principle but cautions that training data is competitive “secret sauce,” so disclosure requirements need to protect trade secrets.

Foundation Models Around the World

  • Arthur sees the proliferation of national LLMs (India, Japan, etc.) as more political than technological. The optimal outcome is a few global companies providing portable, multilingual models that any country can deploy and modify.
  • Language is a key frontier: current models are far better in English than other languages. Mistral is prioritizing multilingual capability, starting with French (where they claim best-in-class performance), with plans to expand globally.
  • Sovereignty concerns are real but solvable: if companies like Mistral ship weights and allow local deployment, countries can control the technology. The problem arises only if a few companies offer only SaaS APIs, preventing local control.

Starting Mistral and Early Confidence

  • Arthur, Timothy Lacroix, and Guillaume Lample had previously worked together at DeepMind (on projects like PaLM and Chinchilla), giving them confidence they could ship world-class models quickly.
  • Paris offered a strong talent pool, and the rise of French AI awareness (partly driven by projects like BLOOM) helped attract VC interest. They secured early team members and moved fast.

Le Chat and Application Strategy

  • Mistral released Le Chat and an enterprise assistant (“Entreprise”) as entry points for organizations unsure how to get started with generative AI. The strategy is to show value through productivity gains before enterprises figure out their core AI strategy.
  • These applications also serve to solidify the underlying developer platform APIs (e.g., moderation tools) and gather end-user feedback. The platform remains the primary product.

Data and Fine-Tuning in Enterprises

  • Arthur advises enterprises not to fine-tune on all their data immediately. Instead, they should start with retrieval-augmented generation (RAG), connecting models to databases and tools.
  • The most valuable data for fine-tuning is demonstration data — traces of what users actually do — which many enterprises don’t yet have. This represents a new data strategy challenge, and companies that acquire this data quickly will have an advantage regardless of their existing data moat.

Quick Takes

  • Overhyped: Synthetic data — Arthur questions what it even means, as the term is poorly defined.
  • Underhyped: Optimization techniques for training efficiency.
  • Biggest surprise: Mistral gained attention much faster than expected, creating prioritization challenges from overwhelming inbound interest.
  • Grok’s latest model: At 340 billion parameters, Arthur thinks it’s too big for its performance level — the Pareto frontier between size and performance isn’t optimal yet, though it may improve.
  • Most exciting AI startups: Dust, a Paris-based knowledge management startup with a sleek UI.
  • If not building models: Arthur would build a foundation model for material science — for example, accelerating the synthesis of ammonia, a carbon-intensive process. He sees material science as lacking a foundational model and ripe for exploration-driven AI.

Debrief Highlights

  • Jacob and Jordan reflect on Mistral’s remarkable momentum, Arthur’s humility and speed of decision-making, and the broader trend of domain-specific foundation models (biology, material science, robotics).
  • They note the recurring theme that evaluation remains the unsolved core problem in AI safety, and appreciate Arthur’s flexible, transparent approach to platform build-vs-partner decisions.
  • The Paris AI ecosystem, anchored by Mistral, is producing world-class research and startups from a remarkably small, concentrated team.
Back to Unsupervised Learning