Reiner Pope is co-founder and CEO of MatX, a startup building specialized chips optimized specifically for running Large Language Models (LLMs). He previously spent years at Google as a TPU architect and worked on the JAX team, giving him deep expertise in both the hardware and software sides of AI infrastructure. MatX is designing chips that aim to deliver both high throughput (economics, tokens per dollar) and low latency (response speed) simultaneously — a combination that current chips struggle to achieve.
Google’s AI comeback and the TPU legacy
A year ago, many believed Google was “canceled” in AI — that LLMs would eat search and Google had no answer. That narrative has shifted dramatically, driven by Gemini 3’s quality and speed, which is powered by Google’s custom TPU hardware.
Google’s AI success rests on foundational decisions: the Transformer architecture originated there, Google Brain attracted enormous talent, and the TPU program gave Google the option to design chips specifically for neural networks rather than being constrained by graphics-oriented GPU architectures like NVIDIA.
TPU v1 was announced in 2016 and was built by a skeleton team of roughly 20–30 people on about a 12–18 month timeline. It was a minimal viable product: one large systolic array with memory next to it. It predated the Transformer and was originally designed with LSTMs in mind, but its parallel architecture happened to be a natural fit for Transformers.
The broader lesson of both TPUs and Transformers is the importance of parallelization. Hardware is fundamentally massively parallel — tens of billions of transistors where it takes about 100 clock cycles to traverse the chip. Matrix multiplication is naturally parallel, and both TPUs and Transformers were designed to exploit this.
The term “mechanical sympathy” — thinking about what the machine wants — originated in high-frequency trading and applies to AI hardware: the goal is to maximize percentage of peak performance, which is a meaningful question on GPUs/TPUs but almost never asked of CPUs because software on CPUs is so far from peak.
Why GPUs beat CPUs for AI
The key intuition is the ratio of control overhead to useful work. A CPU is like a motorcycle — most of the cost is in reading and processing instructions (“what do I do next?”). A GPU is like a truck with many trailers — the same instruction controls a much larger payload. GPUs use wide vector instructions that process many data elements per instruction, shifting cost from control to actual computation.
CPUs are optimized for complex instruction sets and fine-grained branching (steering an obstacle course). GPUs go in a straight line for a long time. This makes GPUs far more efficient for the regular, parallel mathematical workloads in AI.
NVIDIA was uniquely positioned because GPUs were originally built for gamers — mathematically intensive graphics workloads — which happened to map perfectly onto AI computation when the demand emerged.
MatX’s approach: combining the best of both memory worlds
MatX was founded by Reiner and Mike (Google’s former chief chip architect) after they tried to push TPUs toward larger matrices and lower precision but found Google’s chips constrained by other workloads like ads. A startup could take a focused bet on LLMs.
The two metrics that matter for LLM chips are throughput (tokens per second, determining dollars per token) and latency (how fast a response comes). Historically, chips have faced an uncomfortable trade-off between the two:
HBM-based chips (Google TPUs, NVIDIA, Amazon) achieve high throughput by keeping many inferences in flight simultaneously, but latency is constrained by HBM read time (~20ms per token).
SRAM-based chips (Groq, Cerebras) achieve very low latency (~1ms per token) because SRAM is fast, but throughput and dollars per token are poor because SRAM is small and expensive.
MatX’s core idea is to put both HBM and SRAM on the same chip: weights live in SRAM (fast access, low latency) while inference data lives in HBM (large capacity, high throughput). This hits a sweet spot of low latency and high throughput simultaneously.
MatX raised a $500 million Series B led by Jane Street and Leopold Aschenbrenner’s Situational Awareness fund. The capital is needed not for chip design but for ramping manufacturing and supply chain — the actual production of chips at scale requires massive investment in wafers, HBM, rack manufacturing, and data center infrastructure.
Reiner estimates it costs roughly $100 million to produce a chip in small volumes, but the real customers (OpenAI, Anthropic, Google) are buying multi-gigawatt clusters costing tens of billions, all to be deployed within about a year. MatX aims to ship multiple gigawatts of chips per year.
The AI supply chain crunch
The AI buildout faces real bottlenecks across the entire supply chain:
Logic wafers from TSMC or Samsung
HBM from the big three vendors (Hynix, Samsung, Micron)
Rack manufacturing — sheet metal, cables, connectors with high signal integrity requirements for high-speed interconnect
Data centers — primarily power availability and infrastructure
Racks are “sneaky hard”: they must deliver enormous power, extract enormous heat, and maintain signal integrity across dense high-speed cabling that can’t bend too much.
As a startup, MatX’s approach to securing supply is to show up with ironclad customer contracts. The $500 million round also helps signal to suppliers that MatX is well-capitalized and will be around. Some parts of the supply chain (like logic wafers) are fungible, while others require custom manufacturing setups that MatX can now fund.
MatX chip architecture in detail
Three key architectural decisions:
Memory system: Combine HBM and SRAM on the same chip (weights in SRAM, inference data in HBM).
Systolic array: Use a very large systolic array — the gold standard for area- and power-efficient matrix multiplication. The key insight is that inefficiencies appear when you leave the systolic array, so making it bigger means you leave it less often. The challenge is that the attention mechanism in Transformers doesn’t map well onto large systolic arrays (mixture-of-experts layers do). MatX’s solution is a large systolic array that can be split into pieces without losing efficiency.
Low-precision arithmetic: Number formats have gotten progressively narrower (Float32 → lower). MatX supports a range of precisions, likely centered around 4-bit (16 possible values), with mixed precision across layers. MatX has an in-house ML team that trains small LLMs from scratch to research numerics, allowing them to make “sloppy” choices on rounding modes and corner cases that would be risky without experimental validation.
The ML team is unusual for a chip company. Rather than just writing kernels (ML engineering), they do actual ML research — training models to validate numerical design choices. This co-optimization of hardware and model numerics is a key differentiator.
How chips are actually designed
Chip design uses Verilog, a parallel programming language. The mechanics are similar to software — Git, CI, code review — but the workflow is more waterfall than modern software development.
The process: architects define the organization of the chip (how many cores, systolic arrays, vector units), then logic designers write Verilog, then design verification, then physical design.
Before writing any Verilog, architects spend a long time on a performance simulator (written in Python or similar) mapping target applications (Transformers of specific shapes) onto the proposed architecture. This is where most architecture work happens.
EDA tools (from Synopsys and Cadence — Electronic Design Automation) synthesize Verilog into logic gates, then into physical polygons (P-type semiconductor here, N-type there, polysilicon there) — essentially a 3D layout of the chip.
Tape-out (sending the design to the fab) costs roughly $30 million. The ideal is that the first tape-out is the production version. About 50% of the time it works; the other 50% requires re-spinning some or all layers (metal layer fixes cost ~$100K; full re-spin costs another $30M).
Bugs fall into two classes: physical implementation errors (gates too close together, reliability issues) and logical specification errors (the design itself is wrong). Despite extensive verification, some bugs ship — similar to software companies shipping bugs to production.
MatX follows a tick-tock release model: even-numbered years for new transistor/memory/interconnect technology, odd-numbered years for architecture overhauls. This keeps different teams occupied and avoids massive-risk releases every two years.
The CUDA moat and why it matters less for AI
NVIDIA’s defensibility comes not just from chips but from CUDA — a mature software ecosystem refined over more than a decade. This matters enormously in markets with thousands of applications (like gaming), where each game must be programmed against CUDA.
But there are not thousands of LLMs. There are roughly five frontier labs, each with one main model. The economics are different: a frontier lab buys a $10 billion compute cluster and hires ~50 top engineers to write optimized software for that specific chip. This can easily double effective performance.
This means the “CUDA moat” is less relevant for AI chips. Frontier labs are already planning to substantially rewrite software for every new chip generation. MatX doesn’t need to build a broad software ecosystem — it just needs to be compelling enough for a lab to dedicate a team to optimizing for it.
TSMC’s durable dominance
TSMC is effectively the monopoly provider of leading-edge chip fabrication. Despite this, they don’t charge monopoly prices — a deliberate strategy driven by cyclical and cultural conservatism that discourages competitors from entering.
Leading-edge nodes matter primarily for power efficiency (less so for area/density, which has stopped scaling as aggressively). AI chips and mobile phone chips benefit most from leading-edge nodes because they are extremely power-sensitive.
TSMC actively encourages startup customers to diversify their customer pool. MatX works with an ASIC vendor that handles the backend interface with TSMC.
The cost asymmetry is enormous: a fab costs ~$10 billion, while chip development (tape-out) costs ~$30 million. This 300x difference is a major barrier to entry.
Why labs don’t all design their own chips
Google does design its own chips (TPUs). OpenAI is starting. But there’s a trade-off: vertical integration (designing for your exact model) vs. concentration of R&D (five labs buying from one vendor means 5x the R&D budget for that chip).
The multi-year delay from chip design to production means labs can’t design for their current model — they must predict what their model will look like in 2–3 years and hedge. A specialized vendor like MatX can concentrate R&D on the chip while the lab focuses on the model.
Space data centers
Elon Musk has proposed data centers in space. The two main objections are cooling and repair.
Repair: In a cluster of 100,000 chips, some chips are always down. NVIDIA builds in redundancy — 8 spare chips per rack of 64. This works with ~10% reliability tax if someone can service failed parts within a day. If servicing is impossible (space), you might need 100% redundancy (deploy twice as many chips), making it a trade-off between chip capital cost and power savings.
Cooling: The rack-level challenge is getting heat out quickly. Getting heat out of a spaceship is harder and may be the more fundamental objection.
AI predictions for 2026 and beyond
AI for chip design: Current models are excellent at Rust and Python (lots of RL training data) but weak at Verilog and essentially nonexistent at “write me a chip architecture description.” The labs are broadening RL to fill gaps between domains. MatX would love a custom model for chip design but labs prefer to fold improvements into their mainstream models rather than license proprietary versions.
Recursive self-improvement (weak version): AI-assisted chip design is already happening. Writing Verilog, running tests, and CI are big fractions of chip development time (9–15 months) and are well-suited to AI assistance. The bottleneck is physical design (turning Verilog into gates and polygons), which involves graphical/layout work and is harder to compress. The goal would be to tape out a chip in one month, but physical design remains a bottleneck.
Context length: Context size has struggled to grow because every generated token must read through all previous tokens, and memory bandwidth is constraining. The most effective current solution is application-level compaction — when you hit the context limit, have the model summarize/compact the conversation (what OpenClaw does with markdown files). This is primitive but controllable — you can iterate on compaction prompts in seconds vs. months to train a new model architecture.
Parameter count vs. context: Parameter count should grow much faster than context length due to underlying physics of what’s available. This would be a reacceleration after a recent leveling off focused on RL improvements.
MatX timeline: Tape-out in under a year, chips available end of 2027. Early users will see the impact in A/B tests.
MatX culture and team
MatX is about 100 people with hardware, software, and ML teams. The ML team is unusual — they do real ML research (training small LLMs from scratch) rather than just kernel optimization.
The pitch to join MatX: if you love optimization — fitting something into the smallest possible budget, whether software, hardware, or math — it’s an exciting place. Hardware companies offer an unusually broad range of skills and problems on one team.
Impact: a 20% higher throughput chip means 20% more AI happening in the world, either enabling more applications or smarter models.
Rust vs. Go and Haskell
Reiner previously loved Haskell (principled, functional) but now prefers Rust. Rust offers functional programming features (traits/type classes, rich type system) while allowing direct memory mutation when needed.
Rust is particularly valuable for hardware-adjacent software because you care about every single bit — 17, 18, 19-bit integers are natural in Rust. MatX has built an ecosystem of rich hardware data types in Rust.
Rust vs. Go: Rust’s real differentiator isn’t just “safe without garbage collection” — it’s the richer type system. Go’s GC adds header overhead to every allocation, which matters when you’re designing data structures to use exact amounts of memory.
Cuckoo hashing and optimization as a hobby
Reiner’s optimization hobby extends beyond chips. He studied Google’s internal implementations of memory allocators, mutexes, and HashMaps, benchmarked them, and tried to make them faster by examining assembly and eliminating unnecessary memory moves.
He explored hash table optimization through the question: what would the optimal CPU for hash tables look like? This led to combining SIMD vector instructions with cuckoo hashing — a technique from the literature that hasn’t been practical because traditional cuckoo hashing doesn’t use vector instructions. Combining them yields better performance than standard implementations, even on existing Intel CPUs.
Cuckoo hashing works by hashing into two locations and using the less-full bucket. Adding SIMD allows scanning multiple buckets at once. This could benefit any hash-table-intensive workload (like JavaScript engines).
Unexplored model architectures
There is still room for novel model architectures. As hardware changes, the optimal model shape should change too. Current constraints that could be lifted:
Pre-fill and decode use the same model, but they are fundamentally different workloads (pre-fill is parallel, decode is sequential). Using different models for each could be more efficient.
Training and serving use the same model, but training is compute-intensive while serving is memory-bandwidth-intensive. A model that does more computation at inference time to better use available resources could be more efficient.
These are “off the wall” but within the Transformer family — relaxing artificial constraints rather than inventing entirely new paradigms.