Mercor CEO: Evals Will Replace Knowledge Work, AI x Hiring Today & the Future of Data Labeling

Unsupervised Learning 44min 5 min #44
Mercor CEO: Evals Will Replace Knowledge Work, AI x Hiring Today & the Future of Data Labeling
Watch on YouTube

Summary

  • Brendan Foody, co-founder and CEO of Mercor, discusses how AI is transforming talent evaluation, data labeling, and the future of knowledge work. Mercor builds infrastructure for AI-native labor markets, already used by top AI labs to label data, screen candidates, and evaluate both human and AI performance. The company recently raised $100 million and sits at the intersection of recruiting, evaluation, and foundation model improvement.

AI in Talent Evaluation

  • AI is approaching superhuman performance in evaluating candidates through text-based signals like resumes, interview transcripts, and written assessments.
    • Earlier models (pre-2023) struggled with hallucinations and inconsistency, but reasoning models have dramatically improved context handling and judgment depth.
    • Multimodal evaluation (e.g., detecting passion or authenticity in video) remains harder due to limited focus from labs and challenges with reinforcement learning (RL) on video data.
  • Humans still outperform models on nuanced “vibe checks”—assessing cultural fit, passion, or genuineness—but even these are being partially captured via eval development.
  • AI enables far deeper preparation for interviews (e.g., analyzing a candidate’s podcast, papers, or blog posts), making assessments harder to game and more personalized.

How AI and Humans Collaborate in Hiring

  • The hiring process splits into two functions: assessment (predicting performance) and selling (convincing candidates to join).
    • AI will soon dominate assessment, making human-only evaluation obsolete for predictive accuracy.
    • Humans will remain critical in the selling phase—building relationships, explaining roles, and fostering excitement.
  • AI allows recruiters to focus only on top-tier candidates, eliminating time wasted on unqualified applicants.
  • To prevent gaming, assessments must be dynamic—either changing problems frequently or probing deep into specific background details.

Explainability and Trust in AI Predictions

  • Explainability matters for two reasons:
    1. Building customer trust through transparent reasoning chains.
    2. Ensuring models select candidates for valid, job-relevant reasons.
  • Long-term, the economy may shift toward simple APIs that output a confidence interval for job performance, reducing human intermediation.

Challenges in Evaluating Complex Roles

  • Stack-ranking is easy when many people do the same job (e.g., 20 account executives), but extremely hard for unique roles (e.g., founders) with high variability and confounding factors.
  • A major bottleneck is ensuring all relevant context reaches the model—humans often omit informal signals (e.g., word-of-mouth feedback, interpersonal dynamics).
  • Future solutions may involve AI conducting exit interviews or analyzing recorded meetings to capture tacit knowledge.

The Evolving Data Labeling Market

  • The data labeling landscape has shifted dramatically since ChatGPT’s release:
    • Early models needed massive volumes of simple SFT/RLHF data, enabling crowdsourcing.
    • Now, as models improve, only highly skilled experts can create data that meaningfully improves them—especially in open-ended domains like medicine or strategy.
  • Mercor’s platform excels because it sources exceptional global talent quickly, aligning with this shift toward quality over volume.
  • Legacy crowdsourcing players face churn; new entrants focused on elite talent will capture market share.

Will Humans Always Be Needed in Data Labeling?

  • As long as humans can do things models cannot, there will be demand for human-generated evals to teach models those tasks.
  • Domains like math or coding may soon require little human data due to verifiability, but most knowledge work (e.g., evaluating founders, consultants) remains open-ended and hard to verify.
  • This implies an orders-of-magnitude growth in the human eval market over time.

Expanding Beyond Coding

  • Mercor started by connecting undervalued global coders with U.S. companies—a natural fit given coding’s clear evaluation criteria.
  • Expanding into fuzzier domains (e.g., doctors, consultants) requires domain experts to co-design assessments and evals.
    • Example: Evaluating doctors involves identifying proxies like diagnostic reasoning patterns or learning agility.
  • Researchers increasingly rely on subject-matter experts to interpret model failures and build accurate error taxonomies.

Company Vision and Strategic Sequencing

  • Mercor’s long-term vision is a global unified labor market where every candidate applies once and every company hires from the same pool.
  • Their strategy follows a wedge approach:
    1. Start with high-demand, short-cycle data labeling (proven product-market fit).
    2. Expand into high-volume contractor hiring (competing with Accenture/Deloitte).
    3. Move into full-time hiring across all roles.
  • Even in Year 1, they hired contractors for non-data tasks—showing continuity toward end-to-end labor automation.

Key Inflection Point: The xAI Meeting

  • In August 2023, a customer introduced Mercor to xAI’s co-founders at Tesla’s office.
    • Within two days, Mercor met the entire xAI founding team (except Elon).
    • This revealed explosive, unmet demand for elite global engineering talent—just as the market shifted toward needing high-quality human data.
  • Though xAI wasn’t ready for human data yet, this signaled the coming wave. Six months later, Mercor scaled rapidly with frontier labs.

Using Mercor’s Own Product

  • Mercor uses its AI interviewer for all non-executive roles.
    • For executives, humans still conduct first interviews (for relationship-building, not vetting).
  • In one case, replacing human case studies with AI pre-screening increased on-site conversion rates due to better standardization and objectivity.
  • They also use marketplace experts to build internal evals, mirroring customer workflows.

Multimodal and Future Capabilities

  • Video and multimodal analysis are key frontiers.
    • RL could help models search vast video token spaces for signals like excitement or cheating.
    • Success depends on creating targeted data that trains models to attend to relevant cues.

The Future of Knowledge Work

  • Foody believes a huge portion of knowledge work will shift toward creating evals—defining what good looks like in domains models can’t yet handle.
    • This won’t necessarily look like today’s annotation tools; it may involve dynamic conversations with AI about problem-solving approaches.
  • People should optimize for fast learning rates and deep familiarity with AI’s strengths/weaknesses in their domain.
  • Working extensively with AI builds crucial judgment about when to use it—and when not to.

Skills and Career Advice

  • Focus on elastic-demand domains where AI amplifies need (e.g., software engineering) rather than fixed-demand roles (e.g., accounting).
    • Example: Making engineers 10x more productive likely increases total demand for software work.
  • Avoid over-optimizing for today’s “hot” skills; instead, develop adaptability and AI collaboration fluency.

Why End-to-End Over Copilot?

  • Mercor chose to automate hiring end-to-end rather than build copilots for recruiters.
    • Copilots assume the job stays the same—but many roles will be restructured or eliminated.
    • End-to-end systems learn from feedback loops and improve predictions autonomously.
  • This model was validated by the data labeling market’s structure, which allowed full automation early.

Fundraising and Growth Strategy

  • Mercor raised its seed round primarily to justify founders dropping out of college.
  • Series A and B were preemptive, keeping dilution low (~5%) to fund:
    • Supply-side growth (referral incentives, consumer-style product features).
    • Post-training data and eval development to improve prediction models.
  • Their ML team’s main bottleneck is creating more evals and RL environments—a challenge aligned with their core business.

Foundation Model Landscape

  • Foody sees OpenAI as a product company, not an API provider—most API capabilities will commoditize.
  • The market is large enough for multiple labs to dominate verticals (e.g., one building an AI hedge fund).
  • Customization at the application layer will drive value, especially for specialized use cases requiring sophisticated labeling.

Biggest Unknown: What Will Humans Do?

  • The central question shaping Mercor’s mission: What roles will humans play in 5–10 years?
    • Many jobs will be automated, but new opportunities will emerge around defining, training, and overseeing AI systems.
  • Policymakers are overly focused on China competition and existential safety, neglecting near-term workforce disruption.
    • Regulators must proactively manage public expectations and guide education/training for the next generation of jobs.

Quickfire Takes

  • Underhyped: Evals—they’re critical and still underestimated.
  • Overhyped: Legacy SFT/RLHF data—companies are overspending on outdated data types.
  • Changed mind: AI software engineers will arrive sooner than expected—likely late 2025 or early 2026.
    • Impact depends on role elasticity: software engineering will evolve, not disappear.
  • Most excited about: OpenAI’s coding agents and stealth startups building custom vertical agents.
  • If starting today: Would pick a high-impact knowledge work vertical (e.g., finance) and build custom agents to automate it—focusing on societal value over pure profit.
Back to Unsupervised Learning