Real-Time Bidding at Scale: Engineering Sub-18ms Latency Across 1.4B Daily Auctions

Operating a real-time bidding engine at 1.4 billion daily auctions is not a software problem. It is a distributed systems problem with a 100-millisecond window to get the answer right. This is a detailed look at the infrastructure decisions, trade-offs, and ML optimisations that power our auction engine at sub-18ms p99 latency globally.

Why Latency Is Everything in RTB

In programmatic advertising, the entire value chain compresses into a narrow timing window. When a user's browser loads a page, the publisher's ad server issues a bid request. DSPs have typically 80 to 150 milliseconds (depending on publisher timeout settings) to return a bid. Miss that window and you have lost the auction regardless of how high your bid would have been.

At the infrastructure level, that 80 to 150ms contains multiple sequential operations: network transit from publisher to DSP endpoint (10 to 30ms depending on geography), bid request parsing and user lookup (5 to 15ms), ML model inference for bid price prediction (the variable we control most), bid response construction (2 to 5ms), and network transit back to the publisher (10 to 30ms).

The math is unforgiving. At global scale, geography alone can consume 50 to 60ms of your budget before your application sees the request. That leaves 20 to 40ms for everything else. Every millisecond of model inference latency is a millisecond you cannot spend on other computation, and every millisecond of excess latency is auction losses.

18ms p99 auction latency

1.4B Daily auctions processed

99.97% Uptime SLA

Infrastructure Architecture

Edge-First Design

The single most impactful architectural decision we made was deploying bidding nodes at the network edge, co-located with or adjacent to major exchange points of presence. We operate 28 bidding endpoints across North America, Europe, and Asia Pacific, selected specifically for proximity to the major SSPs' data centres.

This design reduces network transit latency by 15 to 25ms for the majority of our bid volume. At scale, that reduction directly translates to higher win rates (because we are bidding within windows rather than at their edge) and lower timeout rates (because our responses arrive reliably within the publisher's patience threshold).

In-Memory Data Architecture

User lookups, frequency caps, campaign budget availability, and targeting segment membership must all be evaluated for every bid request. At 16,000 queries per second per node, any latency in these lookups compounds catastrophically. We store all real-time state in distributed in-memory data structures (built on a modified Redis cluster architecture) with write-through persistence to backing storage for durability.

Cold lookups (cache misses) are handled with a probabilistic miss strategy: rather than waiting for a backing store lookup and risking timeout, we bid conservatively on cache-miss requests using population-level defaults and reconcile the accuracy cost against the timeout cost. At our scale, the population defaults are accurate enough that this trade-off is net positive.

ML Inference at Auction Speed

Our bid price prediction model sits in the critical path of every auction. It must return a recommendation in under 6ms at p99 to stay within our latency budget. This constraint shapes every model architecture decision.

Model Selection Trade-offs

Deep learning models with the highest prediction accuracy are too slow for auction inference. A 3-layer neural network with 256-unit hidden layers requires 12 to 18ms for inference on our hardware. We cannot afford that latency budget. Instead, we use gradient-boosted decision trees (specifically a highly optimised XGBoost implementation with custom CUDA kernels for GPU inference) that produce predictions in 2 to 4ms at p99.

The accuracy trade-off is real: our GBT model achieves slightly lower log-loss than our best neural architecture in offline evaluation. In live A/B testing, however, the latency improvement from faster inference produces better economic outcomes than the accuracy improvement from slower models. Winning more auctions at slightly lower bid accuracy outperforms losing auctions at perfect bid accuracy.

Feature Engineering for Speed

Feature computation adds to inference latency. We precompute all features that can be computed ahead of time (user segment membership, historical win rates by publisher, dayparting weights) and update them asynchronously rather than at request time. Only features that require the current request context (creative dimensions, floor price, competing bid count) are computed in-path.

This precomputation strategy reduces in-path feature computation time from 8 to 11ms to under 2ms per request, at the cost of features that are slightly stale (updated every 30 to 60 seconds rather than per-request). The staleness introduces minimal accuracy degradation for the features involved.

Fault Tolerance and Graceful Degradation

An auction engine that achieves 18ms p99 latency 99.9 percent of the time but spikes to 200ms during any partial failure is not a reliable system. We design for graceful degradation: at every layer of the stack, failures produce conservative fallback behaviour rather than cascading delays.

Specifically: if the ML inference service is degraded, bid requests fall back to rule-based pricing in under 1ms. If a user lookup exceeds 5ms, we substitute population-level defaults. If an edge node is experiencing elevated latency, traffic reroutes to the nearest healthy node with temporary quality degradation rather than elevated timeout rates. The priority ordering is: respond on time with lower accuracy, rather than respond accurately with higher latency.

What We Are Building Next

Two improvements are in active development for 2026. First, model distillation: training a smaller "student" model that approximates the outputs of our larger, more accurate "teacher" model with 40 to 50 percent lower inference latency. Second, speculative execution: pre-computing likely bid responses for high-frequency publisher/format combinations before the request arrives, reducing in-path computation for predictable traffic patterns. Both target moving our p99 latency below 12ms, which would allow us to participate in publisher timeout windows we currently cannot consistently serve.

Real-Time Bidding at Scale: Engineering Sub-18ms Latency Across 1.4B Daily Auctions

Why Latency Is Everything in RTB

Infrastructure Architecture

Edge-First Design

In-Memory Data Architecture

ML Inference at Auction Speed

Model Selection Trade-offs

Feature Engineering for Speed

Fault Tolerance and Graceful Degradation

What We Are Building Next

KEEP
READING.

INTELLIGENCE
DELIVERED.

Real-Time Bidding at Scale: Engineering Sub-18ms Latency Across 1.4B Daily Auctions

Why Latency Is Everything in RTB

Infrastructure Architecture

Edge-First Design

In-Memory Data Architecture

ML Inference at Auction Speed

Model Selection Trade-offs

Feature Engineering for Speed

Fault Tolerance and Graceful Degradation

What We Are Building Next

KEEPREADING.

INTELLIGENCEDELIVERED.

KEEP
READING.

INTELLIGENCE
DELIVERED.