The AI English Tutor Taking the World by Storm — Unsupervised Learning

Connor Zwick, CEO and cofounder of Speak, joins the podcast to discuss building an AI-powered English fluency platform that has grown to over 10 million users across 40+ countries since launching in South Korea in 2019. Speak recently raised at a $500 million valuation with backing from OpenAI. The conversation covers Speak’s product evolution, specialized AI models, defensibility, the future of audio-first interfaces, and how AI will transform education broadly.

Connor’s Background and Path to Speak

In high school, Connor built a flashcard app with millions of users that was eventually sold, giving him early exposure to the challenge of structuring knowledge for learning.
He later audited a Berkeley AI course around 2015, becoming convinced that deep learning models would improve dramatically over time.
Early ideas explored computer vision applications (automated meter maids, custom clothing measurement, weather prediction), but he and cofounder Andrew were ultimately drawn to speech because it felt like building a relationship with technology — a persona behind the interface.
The original flashcard app had hundreds of millions of linked knowledge pairs, which Connor imagined could become an “omniscient tutor” — a vision he says he’s now pursuing with Speak.

What Speak Does

Speak is a full fluency solution focused on helping people speak and have real conversations in another language, as opposed to grammar-focused or vocabulary-memorization approaches.
The pedagogy centers on teaching high-frequency word chunks that appear together in everyday speech, having users repeat them until automatic, then practicing in simulated conversations tied to the learner’s personal motivation (e.g., traveling to Mexico City).
Everything is individualized to the user’s goals, interests, and level.

Product Evolution and Long-Term Orientation

From the beginning, Connor and team took a long-term view: they knew models would improve over 5–10 years and eventually surpass humans on key tasks, so they built product decisions aligned with that north star rather than optimizing for short-term capabilities.
Early unlocks were around accurate speech recognition so the app could reliably understand what learners were saying, then adding phoneme recognition, basic language understanding, and progressively more sophisticated capabilities.
This step-by-step climb allowed Speak to build a significant head start in AI-based learning.

Specialized Models vs. General Foundation Models

Speak builds its own in-house models for specific tasks where general-purpose models fall short:
- Accented speech recognition: Understanding non-native speakers and detecting specific pronunciation mistakes, optimized for low latency and reliability.
- Phoneme recognition: Detecting learner-specific pronunciation and prosodic errors using proprietary training data.
Connor believes general foundation models will eventually subsume these tasks, but in the short to medium term (several years), building specialized models is a worthwhile investment that enables the product experience and generates valuable data.
He draws an analogy to Apple in the 1980s using Intel processors while building valuable technology on top (the OS, firmware) — Speak’s equivalent is what they call “ML scaffolding”: orchestrating models, managing data pipelines, evaluation loops, inference infrastructure, and building representations of language proficiency.
At least 50% of Speak’s product development effort currently touches these ML scaffolding systems, which Connor considers a bigger long-term technological moat than the models themselves.

Evaluation as a Core Competency

Connor considers evaluation one of the most underrated and difficult aspects of building AI products.
For speech tasks, it’s not just word error rate — it involves nuanced judgments like whether a word is unintelligible to a human, whether the model should catch specific learner mistakes, and calibrating model performance to human-level understanding.
A good evaluation framework essentially distills the problem you’re optimizing for, which then makes the optimization straightforward.
When new models like GPT-4o are released, Speak runs them against dozens of internal eval loops across major tasks, plus human evaluations, all distilled into a playbook to avoid organizational thrash.
They also A/B test with real users, tracking guardrail metrics to see if new models actually improve the product.

User Experience and Interface Design

Speak’s design philosophy minimizes user education — any tooltip or explanation signals insufficient design.
The onboarding experience is intentionally minimal: a single microphone button and the question “Why are you learning English?” — users just start talking, which is unfamiliar but powerful.
Familiarity with AI interfaces (via ChatGPT) has meaningfully shifted user comfort with these paradigms.
Connor envisions a hybrid future where users fluidly switch between speaking, typing, and tapping depending on context — speech isn’t always better, but it’s better some of the time.
He’s also excited about proactive/push interfaces: GPUs thinking about users in the background, running analysis overnight, and surfacing personalized insights before the user even opens the app.

Curriculum: Structured Paths and Individualized Learning

Connor resists framing curriculum as either rigidly structured or fully individualized — he thinks both can coexist.
There is a scientifically sound sequence to language learning (e.g., the 100 most common words appear 20% of the time, the top 500 cover 80%), but the specific ordering and selection within those tiers can be personalized.
Humans will remain in the loop for high-level curriculum strategy and artistic creation, but increasingly ML teams are taking over execution, requiring deep cross-functional understanding.
He references Neal Stephenson’s The Diamond Age as an inspiration — a book powered by AI that takes a highly unique and creative direction for an individual learner.

Pricing Strategy

Speak doesn’t have a free tier and doesn’t feel constrained by model costs at their current subscription price point.
Connor thinks about pricing on two extremes:
- Radical accessibility: Software’s near-zero marginal cost means Speak could serve hundreds of millions of people.
- Premium pricing: Millions of consumers already pay hundreds of dollars per month for offline tutoring or classes, so there’s room to charge dramatically more for a truly differentiated experience.
Pricing is counterintuitive and always in flux.

Defensibility and the Duolingo Question

Connor argues AI is sustaining for incumbents only if they’re solving the same problem — but Speak and Duolingo solve fundamentally different problems.
Duolingo’s core users are primarily native English speakers in the US/UK/Australia learning casually; many weren’t learning a language before Duolingo. It’s a brain-training-style app.
Speak’s users are typically non-native English speakers who’ve spent 10+ years trying to achieve conversational fluency but lack access to human conversation partners.
AI clearly helps Speak’s use case far more than Duolingo’s casual learning experience.
On real-time translation potentially obviating language learning: Connor argues that even the best translation has inherent latency (word order differences across languages) and imperfection, and more fundamentally, his users want human connection — not just information transfer. Translation solves tourist needs but not the deeper motivation for fluency.
He sees a rising tide effect: more people using ChatGPT for language learning raises awareness and drives users toward specialized solutions like Speak for serious learning.

Audio-First Applications and GPT-4o

Connor is extremely excited about multimodal audio (speech-to-speech) as a holy grail for language learning — a single continuous model that understands nuance, tone, emotion, confidence, and mistakes without the lossy intermediate step of converting speech to text and back.
He still sees room for audio-only or specialized speech models for niche use cases (security, on-premise, unusual vocabularies) that large general-purpose models won’t serve well.
He doesn’t track the broader startup ecosystem closely but notes that OpenAI Startup Fund portfolio companies have incredible deal flow.

Expansion Beyond Language Learning

Speak is focused on language learning for the foreseeable future but sees three major sectors long-term:
- Schools: People spend enormous time learning in schools.
- Businesses/professional skills: English speaking is a certifiable professional skill in many markets; Speak is building an enterprise version for companies like Samsung and SK in South Korea.
- Personal learning: Connor believes this will be one of the biggest shifts in human activity — reading books, listening to podcasts, watching YouTube, reading articles are all learning-adjacent behaviors that people don’t currently categorize as “learning.” He compares it to how search engines emerged as a recognized behavior category.
He envisions a highly individuated personal learning companion with long-term memory, knowledge of your interests and personality, that proactively provides the right information — similar to the AI in the movie Her.

The Broader Impact on Education

Connor believes education is one of the biggest and most exciting areas for AI-driven change.
Despite Chromebooks in classrooms and digital tools, the fundamental quality of education hasn’t changed much — people still take quizzes on laptops instead of paper, watch video lectures instead of live ones, use digital flashcards instead of physical ones.
The best way to learn 2,000 years ago (one-on-one tutoring, like Socrates teaching Alexander the Great) is still the best way — and AI will finally make that accessible at scale.
He expects the pattern common in technology: overhyped in the short term with disappointments, but over a decade, more change than anyone predicts.
He has a concern that the research community may be over-obsessed with the Transformer architecture and hopes people are investing in alternative approaches.

Other Subjects Beyond Language

Connor thinks language learning has a unique advantage: the status quo (classroom model with 1 teacher and 30 students) is particularly bad for it, so the bar for AI to be dramatically better is lower.
For subjects like math, the current solutions are better, so the delta AI needs to achieve is higher, and adoption is harder because you need to sell to schools or parents.
He’s not convinced we need fundamentally new technology to teach other subjects well — it may be more of a product and market problem than a technology problem.

Quick Fire Round

Overhyped: Everything — lots of funding without real product-market fit or genuine usage activity.
Underhyped: Non-Transformer research directions.
Biggest surprise in building AI features: New technology is never as good as you think and never a panacea; changing user behavior is extremely hard. Open-ended lessons with GPT-4-level transcription were good but not a game changer.
Changed his mind on: Initially thought they’d do all modeling themselves; now recognizes some models cost hundreds of millions to build and alternatives may be needed.
Speak’s website: speak.com, and they’re hiring across all roles at speak.com/careers.

Summary

Connor’s Background and Path to Speak

What Speak Does

Product Evolution and Long-Term Orientation

Specialized Models vs. General Foundation Models

Evaluation as a Core Competency

User Experience and Interface Design

Curriculum: Structured Paths and Individualized Learning

Pricing Strategy

Defensibility and the Duolingo Question

Audio-First Applications and GPT-4o

Expansion Beyond Language Learning

The Broader Impact on Education

Other Subjects Beyond Language

Quick Fire Round