Chip design from the bottom up – Reiner Pope

Dwarkesh Podcast 1h20 6 min #120
Chip design from the bottom up – Reiner Pope
Watch on YouTube

Summary

  • This episode is a deep dive into how AI chips actually work, from the smallest logic gates up to full chip architecture. Reiner Pope, CEO of MatX, walks through the physical and computational primitives that make AI chips function, explaining why certain design choices dominate and what tradeoffs keep chip architects up at night.

The fundamental computation: multiply-accumulate

  • The core operation in AI chips is the multiply-accumulate (MAC): multiply two numbers, then add the result to a running sum.
    • This is the natural primitive because matrix multiplication, which dominates AI workloads, is just a nested loop of MAC operations.
    • In a matrix multiply output[i,k] += input[i,j] * other[j,k], every single step is a MAC.
    • The accumulator needs higher precision than the inputs because rounding errors accumulate across many additions, while each multiplication only introduces error once. This is why a 4-bit multiply is paired with an 8-bit accumulate.

How a MAC circuit is built from logic gates

  • The simplest case is multiplying two 4-bit numbers and adding an 8-bit accumulator.
    • Partial products: Each bit of one number is ANDed with every bit of the other, producing 16 partial products (4×4). Each partial product is just an AND gate.
    • Summation: All partial products plus the accumulator must be added together. This is done with full adders (also called 3→2 compressors), which take three single-bit inputs and produce two outputs (sum and carry) representing the count in binary.
    • The summation proceeds column by column, repeatedly applying full adders until only one number remains. This is called a Dadda multiplier.
    • For a p-bit by q-bit multiply-accumulate, you need p×q AND gates and approximately p×q full adders. The total circuit area scales quadratically with bit width.
      • This quadratic scaling is the single biggest reason low-precision arithmetic (FP4, INT4) is so effective for neural networks: halving bit width doesn’t just double throughput, it roughly quadruples it in terms of circuit area.

Data movement is the dominant cost

  • In a traditional processor (CPU or CUDA core), the multiply-accumulate unit is tiny compared to the infrastructure around it.
    • A register file stores values, and muxes (multiplexers) select which registers feed into the ALU.
    • An n-input mux on p-bit values requires n×p AND gates and (n-1)×p OR gates. For a small 8-entry register file with 4-bit values, the muxing costs 24×p gates—already more than the 4×p gates in the multiply-accumulate circuit itself.
    • Since a MAC needs three inputs (two multiplicands and one accumulator), there are three muxes. Roughly 7/8 of the circuit area is spent on data movement, not computation.
    • This was the state of play before Tensor Cores: most of the chip was overhead, not useful work.

Systolic arrays: baking the loop into hardware

  • The key insight behind Tensor Cores and systolic arrays is to move up one level of abstraction and hardwire an entire matrix-vector multiply, not just a single MAC.
    • Instead of reading a full matrix from the register file every cycle, the weight matrix is stored locally in registers physically adjacent to the MAC units. This exploits the fact that in matrix multiplication, the same weights are reused across many different input vectors.
    • The matrix is loaded slowly (one row per clock cycle, daisy-chained across the array), minimizing the wiring bandwidth needed at the boundary of the systolic array to O(n) rather than O(n²).
    • Inputs flow in one direction, partial sums flow in the perpendicular direction, and results emerge at the bottom. Each column performs a dot product spatially.
    • This dramatically improves the ratio of compute to communication. A 128×128 systolic array (as in older TPUs) amortizes the register file and data movement costs across 16,384 MAC units.
    • The same principle—maximizing compute per unit of communication—shows up at every level of the stack, from gate-level precision choices to data-center-scale inference across chips.

Clock cycles, pipelining, and the speed-throughput tradeoff

  • Chips synchronize billions of transistors using a global clock signal. Every nanosecond or so, all registers simultaneously capture their inputs and the chip advances one step.
    • The clock speed is limited by the longest combinational path (critical path) between any two registers. If logic isn’t finished by the next clock edge, the result is wrong.
    • Pipeline registers can be inserted to split long paths in half, doubling clock frequency—but at the cost of extra register area and latency.
    • There’s a fundamental tradeoff: higher clock speed means lower latency per operation but lower throughput, because more chip area is spent on pipeline registers instead of compute logic. Throughput = work per clock × clocks per second.
    • Loops in logic (e.g., an accumulator that feeds back into itself) are the hardest constraint. You can’t insert a pipeline register in the middle without changing the computation (splitting one running sum into two interleaved sums). These loops often set the maximum clock speed.
    • Manufacturing variance means two chips at the same process node can have different maximum clock speeds depending on how well their critical paths were optimized.

FPGAs vs. ASICs

  • FPGAs (Field-Programmable Gate Arrays) and ASICs use the same conceptual model—gates, registers, wires—but FPGAs are reprogrammable.
    • An FPGA consists of lookup tables (LUTs), registers, and a swarm of muxes connecting everything. A LUT with 4 inputs stores a 16-entry truth table and can emulate any 4-input logic function.
    • Programming an FPGA means configuring all the muxes and LUTs to create the desired circuit.
    • An FPGA is roughly 10x less area-efficient than an ASIC because a LUT that implements a simple 4-input AND function still requires 32 gates (a 16:1 mux), versus 3 gates in an ASIC.
    • The tradeoff: an FPGA costs ~$10,000 per unit but can be reprogrammed; an ASIC costs ~$30 million for the first tape-out but is far cheaper and more efficient at scale.
    • FPGAs are used when workloads change frequently (e.g., monthly) and deterministic latency is critical, as in high-frequency trading.

Deterministic vs. non-deterministic latency

  • CPUs have non-deterministic latency primarily because of caches. Whether a memory access hits or misses the cache depends on ambient conditions (other programs, recent history), making execution time unpredictable.
  • AI chips and FPGAs use a scratchpad model instead: software explicitly manages which data is in fast on-chip memory (scratchpad) versus slow off-chip memory (HBM/DDR). There is no hidden cache lookup.
    • This gives deterministic latency: the programmer knows exactly when data will be available.
    • It’s possible to build a deterministic CPU, but the design choices required (no cache, no branch prediction) make them uncompetitive in general-purpose markets.

GPU vs. TPU architecture

  • A GPU tiles many small, nearly identical streaming multiprocessors (SMs) across the die, each with its own registers, schedulers, and a small Tensor Core. This is like having many tiny TPUs.
    • Advantages: flexible, fine-grained data movement between many units, good for irregular workloads.
    • Constraints: each SM is small, so the systolic array within it is small, and the overhead (register files, warp schedulers) is amortized over fewer MAC units.
  • A TPU uses a coarse-grained design: a few large systolic arrays (matrix units) with a separate vector unit, all on one chip.
    • Advantages: larger systolic arrays amortize overhead better, more compute per unit of communication.
    • Disadvantages: data must move between the vector unit and matrix units through limited perimeter wiring, creating a bottleneck.
  • MatX’s approach involves a “splittable systolic array”—large systolic arrays that can also operate as smaller ones, trying to get the best of both designs.

Brains vs. chips

  • Key differences between brains and silicon:
    • Clock speed: The brain operates at ~100 Hz; chips at ~1 GHz. Higher clock speed drives throughput (the brain runs batch size 1; a GPU runs batch size ~1,000).
    • Energy: Chip energy consumption is dominated by dynamic power—charging and discharging capacitors when bits toggle. Running a chip 1,000× slower saves ~1,000× in energy but doesn’t fundamentally change energy efficiency per operation.
    • Sparsity: The brain has unstructured sparsity (any neuron can connect to any other); chips use structured sparsity to save area.
    • Memory and compute: Both brains and AI chips co-locate memory and compute to reduce data movement, though the brain does this far more extensively.

Why Nvidia’s FP4 is 3× faster than FP8 (not 2× or 4×)

  • Historically, halving precision doubled FLOP count (linear scaling). But because circuit area scales quadratically with bit width, halving precision should actually quadruple throughput.
  • Nvidia’s B300 and beyond report FP4 as 3× faster than FP8, acknowledging the quadratic scaling effect but not fully reaching 4×, likely due to data movement and memory bandwidth constraints that don’t scale as favorably.
Back to Dwarkesh Podcast