CTO Unveils Lessons from Building an AI Coding Assistant

Unsupervised Learning 1h36 7 min #12
CTO Unveils Lessons from Building an AI Coding Assistant
Watch on YouTube

Summary

  • Sourcegraph CTO and co-founder Beyang Liu joins the podcast to discuss AI coding tools, how Sourcegraph built Cody (an AI coding assistant), practical lessons from deploying RAG and AI features at scale, and what the future of software engineering looks like as AI becomes more capable.
    • Sourcegraph’s two main products are a code search engine (helping developers understand large codebases) and Cody (an AI coding assistant for inline completion, chat, and automated commands like writing tests or doc strings).
    • Cody’s key differentiator is its deep integration with Sourcegraph’s code search, giving it context-awareness about a user’s specific codebase, frameworks, and production environment — going beyond generic Stack Overflow-style answers.

Advice for the Next Generation of Coders

  • Despite fears that AI will replace programmers, Beyang believes there has never been a better time to get into software.
    • AI will expand the horizons of what’s possible in software creation, but the essence of the job — connecting end-user value to underlying technical primitives — remains critical.
    • The “middleware” layer between user-facing applications and low-level computing will need to be rethought, but the fundamental skill of technical thinking stays essential.

Where AI Coding Works Today

  • The current state of AI coding is best described as “inner loop acceleration” — tools that speed up the daily iterative cycle of writing, testing, and refining code.
    • Inline code completion and codebase-aware chat are in heavy daily use and reliably helpful, especially for commonplace functions and boilerplate.
    • Full multi-step autonomous “bot-driven development” (like Cognition’s Devin) shows promise but still requires a human watching and checking; no one has fully cracked reliable end-to-end automation yet.
    • Moving from inner-loop assistance to full automation requires: a virtual execution environment for trial-and-error, excellent context retrieval from the codebase, and a feedback loop where the model learns from its attempts.

How Model Improvements Unlock New Capabilities

  • GPT-4 vs. GPT-3.5: The jump to GPT-4 significantly improved Cody’s ability to integrate retrieved codebase context into working code examples, making chat interactions much more reliable.
    • A demo that was impossible with GPT-3.5 — zero-shot generation of a stock-tracking app using specific libraries (React, Alpha Vantage API, Recharts) — became feasible with GPT-4 combined with Sourcegraph’s context.
  • Context windows: Larger context windows help, but simply stuffing an entire codebase into the window is not as effective as tailored retrieval. The best architectures combine large context windows with purpose-built RAG.
  • When GPT-5 (or any new model) launches, Sourcegraph’s plan is to make it available in Cody immediately, let users choose it, and observe real-world product metrics rather than gatekeeping behind internal evals.

Who Benefits Most from AI Coding Tools

  • There is a rough (not absolute) segmentation:
    • Junior developers tend to get more value from inline completions, which act as a pedagogical tool showing them the “median” way to do something.
    • Senior engineers tend to prefer chat and often dislike inline completions because incorrect suggestions disrupt their flow and they already know exactly what they want to write.
  • Sourcegraph has partially addressed senior developer complaints by improving context quality, making completions more accurate and less disruptive.

RAG, Search, and Context Engineering

  • Sourcegraph’s RAG pipeline evolved from simple keyword search to a more sophisticated system, but keyword search remains a core component.
    • They started with classical keyword search because it’s easy to implement, understand, and iterate on — and for a long time this alone put them at the frontier of context quality.
    • Naive vector embeddings + nearest-neighbor retrieval often performed worse than keyword search in practice, producing noisy results and missing obvious answers.
    • The best results come from combining multiple retrieval strategies (keyword search, semantic search, reference graph walking, usage example finding) tailored to the specific codebase.
  • They built their RAG engine in-house rather than using external vendors because the technology was so early that no external solution was clearly better, and building in-house avoided abstraction boundaries that would slow down learning and differentiation.
  • Search and RAG are treated as two parallel threads within the company — one focused on end-user search quality, the other on context retrieval for AI — with the expectation that improvements in each will benefit the other.

Model Evaluation Philosophy

  • Sourcegraph relies primarily on product metrics (acceptance rate, engagement, usage volume) rather than offline benchmarks to evaluate models.
    • Acceptance rate for completions and engagement for explicitly triggered features (chat, generate test) are the key metrics.
    • They have found cases where a model that benchmarks lower actually produces better day-to-day user results.
  • They do maintain offline benchmarks for new models, especially for retrieval quality, but the ultimate test is whether users find the output valuable enough to accept and use.
  • Rather than internally evaluating every model in every context, they give users freedom to choose among frontier models (Claude 3, GPT-4, Mixtral, etc.) and learn from the aggregate signal.

Fine-Tuning and Model Strategy

  • Sourcegraph initially avoided fine-tuning, following the principle of “do the dumb thing first” — RAG was faster to iterate on and didn’t require gathering training data or paying for compute.
    • They began fine-tuning when they identified specific gaps that RAG alone couldn’t solve, particularly language-specific code generation (e.g., Rust, Ruby, MATLAB) where frontier models underperform due to limited training data.
  • They use Cody Ignore, a feature that lets users exclude certain files from being used as context — originally built for sensitive files, now evolving into a lever for engineering leaders to control code quality by excluding low-quality code from AI context.
  • They use open-source models extensively; StarCoder is their primary code completion model. Open-source models offer advantages like fine-tuneability and the ability to inspect attention weights during inference.

Inference Cost and Pricing

  • Inference cost is not a primary concern in the short term because costs are expected to decrease significantly over time. Sourcegraph focuses on adding value rather than optimizing for cheapest inference.
    • Rate limiters exist mainly to prevent abuse of free tiers, not to control costs from legitimate users.
  • Pricing model: active user per month (a variant of seat-based pricing where you only pay if a user actually logs in and uses the product in a given month).
    • This aligns incentives — customers pay in proportion to value received, and low usage signals Sourcegraph to improve.
    • Cody can be used standalone (free or Pro tier) without Sourcegraph’s code search product; enterprises can optionally connect Cody to a Sourcegraph instance for better context.

Team Structure

  • Sourcegraph’s org has three main buckets: a model layer team (focused on fine-tuning and quantitative benchmarks), a code search team, and a Cody team.
    • The Cody client team started with 2 people, grew to 5–6, and is still under 10 direct engineers, though many more support it through context-fetching infrastructure.
    • Code search and Cody teams are expected to converge over time as synergies between the two products deepen.

Getting Ahead of the Agentic Future

  • Sourcegraph maintains a portfolio of experiments targeting problems amenable to full automation, building execution loops on top of their existing context providers.
    • Working on outer-loop automation (end-to-end issue resolution) improves inner-loop tools too — for example, test generation is critical for agents to verify their work, and it’s also valuable as a standalone inner-loop feature.
    • Their philosophy: high-quality context providers are cross-cutting infrastructure that will be valuable no matter what new UI or agent paradigm emerges.
    • They are increasingly thinking in terms of building blocks (APIs, custom commands, context engines) that external developers and customers can assemble, not just first-party applications.

Milestones That Would Be Meaningful

  • A significant milestone: if 80% of simple bug fixes (informed by production logs and stack traces) became automatically solvable.
  • Model quality and controllability remain key drivers — more powerful and tunable models would reduce the need for engineering workarounds and guardrails.

The Future of Engineering

  • In aggregate, the number of engineers will grow, but the day-to-day experience of being a developer will change drastically.
    • Most developers spend little of their time actually producing software; most time goes to context acquisition, understanding existing codebases, and communication overhead.
    • AI will steal the toil and drudgery, leaving more time for the “magic” parts of development — the creative, high-value work.
    • The job will be different but better.

The AI Coding Market

  • The market is huge and growing because software literacy is expanding and AI puts software creation in more people’s hands.
    • There will be room for many applications serving different domains and development styles.
    • Sourcegraph’s role is to ensure the ecosystem remains open, preserving freedom of choice for individual developers and enterprises across models, code hosts, and technology stacks.
    • They are wary of large players using AI disruption to vertically integrate and force developers onto proprietary platforms.

Over-Hyped and Under-Hyped

  • Over-hyped: AGI as a mystical concept — the idea that simply scaling Transformers will lead to superintelligence that either saves or destroys humanity is, in Beyang’s view, “not even wrong.”
  • Under-hyped: Formal specifications and formal languages as complements to AI. Natural language is not precise enough for describing what you want; programming languages and formal systems will become increasingly important, not obsolete. Math exists because natural language was too imprecise — the same logic applies to code.

Surprises Along the Way

  • Harder than expected: Context engineering — finding models that effectively integrate retrieved context and formatting that context so the model actually uses it correctly. This has been a deeper rabbit hole than anticipated.
  • Easier than expected: The pace of improvement in model output quality and efficiency. Each new generation of models has moved the needle significantly, eliminating the need for some of the hacks and workarounds Sourcegraph had built.

Cognition and Devin

  • Beyang respects the Cognition team and thinks Devin offers a compelling vision of what a fully automated AI coding UI could look like. The next step is shipping a product people use every day.

Open-Source Models

  • Open-source models will become very widespread in terms of usage, though proprietary models will retain a unique advantage through proprietary data, training regimes, or architecture.
    • Advantages of open-source: fine-tuneability, ability to inspect attention weights, and no API dependency wall.
    • StarCoder (open source) is Cody’s primary code completion model.

Inference Provider Choice

  • Sourcegraph chose Fireworks as their inference provider because of their customer focus and business orientation — they helped Sourcegraph spin up quickly and provided practical advice on improving inference quality and speed.
    • The three dimensions that matter for inference providers: cost, speed, and quality.

What Else Would Be Exciting to Build

  • Vertical knowledge work operating systems: AI enables rethinking how work is done in specific domains (financial services, healthcare, consulting) — building domain-specific operating systems for knowledge work.
  • Consumer “concierge as a service”: A Google-like service that actually does things for you rather than just answering queries — moving beyond GUIs to a paradigm where you tell the system what to do without scrolling, tapping, or getting distracted.
Back to Unsupervised Learning