The Company Leading the AI Music Revolution

Unsupervised Learning 1h2 4 min #13
The Company Leading the AI Music Revolution
Watch on YouTube

Summary

  • Mikey Schulman, CEO of Suno, discusses the AI music platform that has attracted over 10 million users generating songs and recently raised $125 million in funding. The conversation covers Suno’s product vision, the social and creative potential of AI-generated music, infrastructure challenges at scale, and the broader future of audio in AI.

Suno’s core product and user experience

  • Suno lets anyone generate songs from text prompts in seconds, with two main user types emerging:
    • Casual users who “soundtrack their life” — making songs about everyday moments like a Starbucks name mix-up or unexpected visitors, treating music as a form of storytelling and communication
    • Power users who spend hours crafting specific sounds and stories, treating the platform as a creative outlet to get the music in their heads into reality
  • The platform’s speed is a deliberate design priority — songs stream while still being generated (autoregressive Transformer architecture), and every 100ms of latency lost increases the chance a user clicks away
  • Suno uses Modal for GPU infrastructure deployment and benefits from open-source tooling (like continuous batching) originally developed for text and image models

The social and multiplayer future

  • Schulman sees music as a conversation, and envisions Suno evolving into a multiplayer experience:
    • Synchronous jam sessions where users riff on each other’s musical ideas in real time, similar to how friends jam together regardless of skill level
    • Asynchronous collaboration where users pass half-finished songs or musical ideas back and forth
  • A Twitch streamer recently drew a football-stadium-sized audience by streaming himself creating Suno songs live, with viewers able to micro-pay to interact — an early signal of the live, social potential
  • Schulman’s personal favorite Suno creations are songs he made with his three-year-old son (including one about the son driving a Zamboni), because the joy is in the shared creative process, not just the final product

Product philosophy and the blank canvas problem

  • Suno still struggles with the “blank canvas” problem — new users often don’t know what to prompt
  • The solution is not better prompting guidance but rather more intuitive entry points: humming a melody, uploading a photo, capturing ambient sounds, or mood-boarding — moving beyond text as the sole interface
  • A Valentine’s Day feature demonstrated the principle: giving people an obvious reason to make a song (for a loved one) and guiding them through the process lowers the barrier to entry
  • Schulman believes music should be treated as a “first class citizen” of communication, on par with text messages and photos

Model evaluation and improvement

  • Evaluating music models is harder than text because there are no objective benchmarks — quality is inherently subjective
  • Suno relies on a combination of:
    • Automatic metrics for audio quality (acknowledged as flawed)
    • Human evaluators who deeply love music and can make aesthetic judgments
    • Implicit user signals (which model users choose when given options, how much they use it)
    • Explicit feedback from an active Discord community of 300,000+ users
  • Model fixes are case-dependent — for example, one model had unreliable song endings with overly long outros, which was diagnosable and fixable, but many issues require manual data inspection
  • Key areas for improvement include iterative control (letting users say “do that but change X”) and precise attribute control like BPM

Pricing and business model

  • Suno currently offers a free tier with limited songs and paid tiers for heavier usage, but Schulman is deliberately not innovating on pricing yet
  • His reasoning:
    • The market is too early; how people will enjoy and pay for AI music is still unknown
    • SaaS pricing models don’t fit well because AI music has real marginal compute costs (unlike traditional SaaS)
    • He partially blames VCs for importing SaaS-era pricing expectations into a fundamentally different cost structure
    • He expects pricing to look different across music, text, and video

Audio as a first-class AI modality

  • Schulman is excited that audio is gaining recognition as a primary interface for AI, not just an add-on to text-based LLMs
  • Current audio models (including GPT-4o and ElevenLabs) are impressive but still function primarily as interfaces into text-based reasoning layers
  • He believes true multimodal integration — where audio is natively part of the reasoning — is further away than people think, which means specialized audio models will coexist with general-purpose ones for longer
  • On the question of consolidation: he expects many specialized models to persist because applications are so diverse (robotics, healthcare, chip design, etc.) and because major tech companies each need their own offerings

The $125 million fundraise

  • The capital is being deployed to accelerate Suno’s vision of the future of music, not allocated to a single initiative
  • Key uses include:
    • Training larger, more capable music models (music modeling is far less settled than text — the right architecture isn’t even known yet)
    • Research into specialized data and control axes
    • Hiring top talent
  • Schulman notes that music models won’t require the same compute as the largest text models in the near term, but research costs, specialized data, and control mechanisms make training increasingly expensive

IP, artist partnerships, and the competitive landscape

  • Suno is working with the music industry and some artists, but Schulman has deliberately avoided “artist partnership” features that let users generate songs in the style of specific artists (e.g., new Charlie Puth songs)
  • His view: these are viral moments that fade quickly, analogous to early GPT users writing Shakespeare sonnets — fun once, not the lasting use case
  • He sees the market as large enough for many players:
    • Professional artist tools (not Suno’s focus)
    • Background music for content creators (a huge market given YouTube’s scale)
    • Consumer experiences for average people (Suno’s focus)
  • He draws a Spotify vs. Napster analogy: some companies will work with the industry, others against it

Broader reflections

  • Open source is overhyped (Schulman’s hot take): the compute barriers to state-of-the-art models make it hard for open source to keep up without a sustainable business model, though Meta’s resources show it’s possible with sufficient financial backing
  • Music is underhyped: even outside AI, music is underused in people’s daily lives relative to its potential
  • Biggest surprise: how much users want to feel proud of what they created — evidenced by users editing their song titles to include their names when songs hit the trending page
  • Biggest mistake: underestimating how quickly users would prefer a web app over Discord; within five days of launching a thin web app, 90% of usage migrated there, because Discord is a messaging platform, not a music experience platform
  • Schulman’s personal dream product: a Vision Pro app where he can play air guitar with a band, with the music responding in real time to his movements — essentially turning music creation into an intuitive, game-like experience
Back to Unsupervised Learning