Ashutosh Agrawal was a software architect and principal engineer at JioCinema (now Disney+ Hotstar), where he helped architect the system that set a world record in May 2023 with 32 million concurrent live streams during the Indian Premier League (IPL) cricket finale. The episode covers how large-scale live streaming works behind the scenes, the engineering trade-offs involved, and the operational discipline required to sustain such massive scale over a 70-day tournament.
How live streaming works at an architectural level
Source feed originates at the venue: Multiple cameras at the stadium feed into a Production Control Room (PCR), where a director selects angles and produces a single broadcast feed, similar to movie production.
Contribution encoder compresses the feed: The raw source feed (which can be hundreds of Mbps) is compressed down to a standard profile (e.g., 40 Mbps) for transport to the cloud over private links, not the public internet.
Distribution encoder creates multiple formats: In the cloud, the feed is encoded into HLS, DASH, or other formats tailored to different devices (mobile, TV, etc.), each with different bitrate and resolution combinations.
Hundreds of stream variants are produced: With 13+ language feeds, multiple platform targets (Apple, Android, TV), and various resolution/bitrate combinations, the system generated 500+ output stream variants to the CDN.
An orchestrator manages the workflow: This internal engineering system controls contribution encoding, cloud infrastructure, distribution encoding, and CDN endpoints. It generates playback URLs and pushes them to the Content Management System (CMS) that users interact with.
Playback involves a separate, more complex system: When a user hits play, a playback system handles authorization, DRM encryption, and CMS lookups before returning an encrypted URL the client uses to stream.
How HLS and CDNs enable live streaming at scale
HLS uses a manifest-based pull model: A master manifest (.m3u8) points to child manifests for each resolution layer (240p, 480p, 720p, etc.). Each child manifests lists video segments (typically 4–6 seconds each). The player polls the child manifest every segment duration to discover new segments and downloads them from the CDN.
CDNs operate on a 4–6 second granularity, not milliseconds: This segment-based approach makes CDNs effective for live streaming despite being caching infrastructure. The engineering challenge is tuning TTLs—too short means constant cache misses, too long risks stale data.
Latency is inherent and intentional: Every encoding stage adds latency because compression works better with a look-ahead period (Group of Pictures). Additionally, clients maintain a buffer of 5–10 seconds behind the live point to ensure smooth playback and avoid rebuffering. This explains the familiar experience of hearing neighbors cheer before seeing the action.
Trade-offs between latency, smoothness, and scale
Shorter segments increase request volume: Reducing segment duration from 4 to 2 seconds doubles the number of manifest calls and CDN requests, even though the same amount of video data is transferred. This directly increases compute load on CDN infrastructure.
There is a fundamental trade-off between liveness and reliability: Keeping users closer to the live point risks rebuffering on a complex distributed system; adding buffer improves smoothness but increases latency. The team found a sweet spot that balanced these factors.
Internet streaming is fundamentally different from TV/broadcast: Traditional TV pushes a real-time signal—if you miss a packet, you miss it. Internet streaming is a pull-based, buffered, adaptive experience where the client manages quality, buffering, and network variability independently.
Adaptive bitrate streaming
The client player drives adaptive bitrate selection: The player measures download speed by timing how long each segment takes to download. If bandwidth drops, it switches to a lower-resolution layer; if bandwidth improves, it switches up.
Server-side controls complement client-side logic: The server can limit available layers to degrade experience under load. Parameters like switching thresholds, starting bitrate, and buffer thresholds are fine-tuned through engineering.
Mobile tower congestion adds a third variable: Even if the streaming service and client are working perfectly, mobile towers have finite capacity. As more users connect to the same tower, natural throttling occurs, which the streaming service cannot control or directly observe.
Monitoring, metrics, and observability
Leading indicators vs. trailing indicators: Leading indicators (e.g., buffer time per minute, play failure rate) are prioritized for real-time alerting and processed within 30–60 seconds. Trailing indicators provide detailed post-incident analysis via dashboards.
Client-side metrics must be uploaded, creating a scaling trade-off: With 30+ million concurrent users, uploading metrics every second is infeasible. The team uses sampling and adjusts collection frequency dynamically—more frequent during low load, throttled during peak.
A degradation framework governs data collection: During high-traffic events, non-critical data collection is reduced or sampled to prioritize playback and content delivery systems. This is planned in advance through tiered service priorities.
Capacity planning as a core engineering discipline
Capacity planning starts a year in advance: Physical infrastructure (data center space, power, network links, server procurement and import) takes months to scale. Providers need advance notice, especially for year-on-year growth.
Planning is pessimistic, not optimistic: The team models traffic based on previous years plus platform growth, then plans for a worst-case number. They work backward from the target concurrency to determine requirements for every system tier.
Resources planned include compute, RAM, disk, and network: Network is particularly complex for streaming—video networks are designed for high cache offload, while API networks carry mixed traffic with security layers, firewalls, and PII data, requiring different capacity treatment.
Cloud and CDN capacity is finite: Despite the perception of infinite cloud resources, at JioCinema’s scale, the team hits real limits. Adding capacity requires providers to acquire real estate, procure servers, and onboard them—processes that cannot happen overnight.
Geographic constraints matter: CDN capacity is hierarchical (edge → regional → backbone). Overflow traffic from one city (e.g., Bangalore) cannot simply be rerouted to another (e.g., Delhi) without sufficient backbone capacity, which is also finite.
APAC-specific engineering challenges
Mobile-first audience: India largely skipped the desktop/laptop era—users went straight to mobile devices. This means the entire experience must be optimized for mobile constraints.
Mobility and tower switching: Many users watch while commuting (taxi drivers, office workers heading home). Devices switch between 5G, 4G, and 3G towers mid-stream, causing network quality fluctuations.
Battery consumption is a first-order concern: Evening matches mean users haven’t charged their phones all day. The team considers codec complexity (H.264 vs. H.265), download frequency, screen brightness, and color intensity when designing the client experience. These decisions are made at feature-build time, not at runtime, to minimize variables during live events.
Scaling strategy: concurrency as the golden metric
Auto-scaling doesn’t work for live events: Standard auto-scaling responds to current traffic metrics (RPS, CPU) with cooldown periods. During an innings break, millions of users drop off and then return simultaneously when play resumes. Auto-scaling would scale down during the break and be unable to handle the sudden surge back.
Custom scaling systems use concurrency as the single scaling metric: The entire organization aligns on concurrent user count as the golden metric. Models translate concurrency into expected requests per service based on user journey analysis (e.g., X% of users go to the playback page, Y% trigger certain APIs).
User journeys create unexpected traffic patterns: When a key batsman gets out, many users press the back button, suddenly spiking homepage API traffic that was previously quiet. These behavioral patterns must be modeled in advance.
Concurrency accuracy requires triangulation: Since the concurrency number drives all scaling, it must be validated through multiple proxy signals. If it’s wrong, every system scales incorrectly.
Operational discipline: game day drills
Game day is a full simulation of a live event: The team runs synthetic traffic (using tools like Flood.io) that mimics real match conditions—including the full operating protocol with timelines and checklists. Teams don’t know the traffic pattern in advance, just as in production.
The drill starts whether teams are ready or not: Just as a real match starts at 7:30 PM regardless of preparation, the simulation starts on schedule. This forces teams to build systems that can self-report health without human intervention.
These drills surface hidden failures: Ashutosh describes confidently breaking systems that teams believed were ready—database connection pool limits, queue depth issues, and overlooked bottlenecks that only appear under realistic load.
Key principles for engineers building large-scale systems
Assume everything will fail (Murphy’s Law): Overconfidence is the biggest risk. Every configuration, every connection pool size, every queue depth must be questioned and tested.
Measure deeply and understand internals: Don’t rely on load balancer-level metrics alone. Understand queuing behavior, database connection limits, and end-to-end latency including time spent in queues. Know how your tools work internally.
Avoid over-reliance on APM tools: Ashutosh deliberately avoids APM (Application Performance Management) tools because he believes they make engineers lazy. Not having them forces deeper understanding of every corner of the system and more careful measurement.
Learn from the IPL’s 70-day marathon: The record-setting finale was actually calmer than the opening week. The real challenge is sustaining high scale day after day, learning traffic patterns, and improving operational protocols year over year—eventually automating enough that 20–30 million concurrent streams can run without manual intervention.
Rapid fire
Most productive language: Java for building systems and services.
Favorite language: Ruby for scripting—described as “like writing English.”
Stays current via: Hacker News and LinkedIn (curated through his network of connections).