The Company Leading the AI Music Revolution — Unsupervised Learning

Mikey Schulman, CEO of Suno, discusses the AI music platform that has attracted over 10 million users generating songs and recently raised $125 million in funding. The conversation covers Suno’s product vision, the social and creative potential of AI-generated music, infrastructure challenges at scale, and the broader future of audio in AI.

Suno lets anyone generate songs from text prompts in seconds, with two main user types emerging:
- Casual users who “soundtrack their life” — making songs about everyday moments like a Starbucks name mix-up or unexpected visitors, treating music as a form of storytelling and communication
- Power users who spend hours crafting specific sounds and stories, treating the platform as a creative outlet to get the music in their heads into reality
The platform’s speed is a deliberate design priority — songs stream while still being generated (autoregressive Transformer architecture), and every 100ms of latency lost increases the chance a user clicks away
Suno uses Modal for GPU infrastructure deployment and benefits from open-source tooling (like continuous batching) originally developed for text and image models

Schulman sees music as a conversation, and envisions Suno evolving into a multiplayer experience:
- Synchronous jam sessions where users riff on each other’s musical ideas in real time, similar to how friends jam together regardless of skill level
- Asynchronous collaboration where users pass half-finished songs or musical ideas back and forth
A Twitch streamer recently drew a football-stadium-sized audience by streaming himself creating Suno songs live, with viewers able to micro-pay to interact — an early signal of the live, social potential
Schulman’s personal favorite Suno creations are songs he made with his three-year-old son (including one about the son driving a Zamboni), because the joy is in the shared creative process, not just the final product

Suno still struggles with the “blank canvas” problem — new users often don’t know what to prompt
The solution is not better prompting guidance but rather more intuitive entry points: humming a melody, uploading a photo, capturing ambient sounds, or mood-boarding — moving beyond text as the sole interface
A Valentine’s Day feature demonstrated the principle: giving people an obvious reason to make a song (for a loved one) and guiding them through the process lowers the barrier to entry
Schulman believes music should be treated as a “first class citizen” of communication, on par with text messages and photos

Evaluating music models is harder than text because there are no objective benchmarks — quality is inherently subjective
Suno relies on a combination of:
- Automatic metrics for audio quality (acknowledged as flawed)
- Human evaluators who deeply love music and can make aesthetic judgments
- Implicit user signals (which model users choose when given options, how much they use it)
- Explicit feedback from an active Discord community of 300,000+ users
Model fixes are case-dependent — for example, one model had unreliable song endings with overly long outros, which was diagnosable and fixable, but many issues require manual data inspection
Key areas for improvement include iterative control (letting users say “do that but change X”) and precise attribute control like BPM

Suno currently offers a free tier with limited songs and paid tiers for heavier usage, but Schulman is deliberately not innovating on pricing yet
His reasoning:
- The market is too early; how people will enjoy and pay for AI music is still unknown
- SaaS pricing models don’t fit well because AI music has real marginal compute costs (unlike traditional SaaS)
- He partially blames VCs for importing SaaS-era pricing expectations into a fundamentally different cost structure
- He expects pricing to look different across music, text, and video

Schulman is excited that audio is gaining recognition as a primary interface for AI, not just an add-on to text-based LLMs
Current audio models (including GPT-4o and ElevenLabs) are impressive but still function primarily as interfaces into text-based reasoning layers
He believes true multimodal integration — where audio is natively part of the reasoning — is further away than people think, which means specialized audio models will coexist with general-purpose ones for longer
On the question of consolidation: he expects many specialized models to persist because applications are so diverse (robotics, healthcare, chip design, etc.) and because major tech companies each need their own offerings

The capital is being deployed to accelerate Suno’s vision of the future of music, not allocated to a single initiative
Key uses include:
- Training larger, more capable music models (music modeling is far less settled than text — the right architecture isn’t even known yet)
- Research into specialized data and control axes
- Hiring top talent
Schulman notes that music models won’t require the same compute as the largest text models in the near term, but research costs, specialized data, and control mechanisms make training increasingly expensive

Suno is working with the music industry and some artists, but Schulman has deliberately avoided “artist partnership” features that let users generate songs in the style of specific artists (e.g., new Charlie Puth songs)
His view: these are viral moments that fade quickly, analogous to early GPT users writing Shakespeare sonnets — fun once, not the lasting use case
He sees the market as large enough for many players:
- Professional artist tools (not Suno’s focus)
- Background music for content creators (a huge market given YouTube’s scale)
- Consumer experiences for average people (Suno’s focus)
He draws a Spotify vs. Napster analogy: some companies will work with the industry, others against it

Open source is overhyped (Schulman’s hot take): the compute barriers to state-of-the-art models make it hard for open source to keep up without a sustainable business model, though Meta’s resources show it’s possible with sufficient financial backing
Music is underhyped: even outside AI, music is underused in people’s daily lives relative to its potential
Biggest surprise: how much users want to feel proud of what they created — evidenced by users editing their song titles to include their names when songs hit the trending page
Biggest mistake: underestimating how quickly users would prefer a web app over Discord; within five days of launching a thin web app, 90% of usage migrated there, because Discord is a messaging platform, not a music experience platform
Schulman’s personal dream product: a Vision Pro app where he can play air guitar with a band, with the music responding in real time to his movements — essentially turning music creation into an intuitive, game-like experience

Summary