Routing Inference Requests
The previous post went down the stack into the GPU to explain why inference serving is different from traditional web services. The key takeaway: prefix caching lets the GPU skip redundant work, but only if requests with the same prefix land on the same server. The cache is local to each GPU. Routing determines whether you get the reuse or pay the full cost again.
I wanted to see the tradeoff in practice, so I built a router, pointed it at two GPU backends across AWS and GCP, and ran experiments. Cache affinity improves latency, load balancing improves throughput, and a single parameter controls the balance between them.
The setup
I wanted to test cross-cloud routing with different GPU tiers and cost profiles, so I set up two backends running Qwen 1.5B with vLLM. Qwen 1.5B is small enough to fit on both a T4 and an L4 while still having meaningful prefix caching behavior.
- EKS in us-east-1: g4dn.xlarge (T4, 16 GB VRAM)
- GKE in us-east1: g2-standard-4 (L4, 24 GB VRAM)
A FastAPI router sits in front of both, listening on port 8080. A background poller hits each backend’s /metrics endpoint every 3 seconds to get queue depth and GPU cache usage. The load generator fires batches of concurrent requests with configurable system prompts and priority levels, measuring TTFT (time to first token) via streaming.
The router
The core idea is cache affinity: hash the system prompt, route to the backend that already has those KV cache blocks computed. If that backend is overloaded, spill to the next best option. I built this on top of a scoring function that weighs queue depth and latency:
score = (queue_depth + in_flight + 1) × latency_ewma
The in-flight counter was an early addition. The /metrics poller runs every 3 seconds, so right after the router sends a burst of requests, the polled queue depth is stale. The in-flight counter tracks what the router has sent but the poller hasn’t seen yet, bridging that gap.
On top of caching and scoring, the router has a queue_threshold parameter that controls backpressure. If every backend’s queue exceeds the threshold, the request gets a 429. I also added priority tiers: high-priority requests bypass backpressure entirely, and low-priority requests start getting shed at 50% of the threshold. The priority tiers turned out to matter most under saturation, where they let the system shed low-value traffic first instead of degrading everything equally.
Cache affinity works, until it doesn’t
I ran 200 requests at each concurrency level with 5 system prompts of ~200 tokens, comparing cache-aware routing against plain round-robin.
Cache-aware routing cut TTFT by 42% at low concurrency (65ms vs 113ms at concurrency 1). Every request landed on the backend that already had its system prompt cached, skipping prefill entirely. The advantage held through concurrency 16 with a 100% cache hit rate. At concurrency 32 the gap narrowed as queuing delays on the hot backend ate into the cache savings, but cache-aware still had lower TTFT.
The cost showed up in throughput. At concurrency 32, cache-aware routing had sent all 200 requests to a single backend. That meant 32 requests queued on one GPU while the other sat idle, processing them serially instead of in parallel. Throughput dropped to 13.2 rps and the p95 spiked to 15 seconds. Round-robin split traffic 50/50 across both GPUs, giving up cache hits but using twice the compute. It hit 30.3 rps with a p95 of 1.4 seconds, 2.3x better throughput.
It’s the same tradeoff that shows up in session affinity for stateless services, or cache partitioning in CDN architectures. Affinity improves hit rate. Concentration creates hotspots.
A few caveats on the data. Qwen 1.5B is a small model where prefill is already fast. With a 7B or 70B model, prefill takes proportionally longer and cache hits save proportionally more. The 42% relative improvement would likely hold or increase. These tests also ran through kubectl port-forward, which adds tunnel overhead to the absolute numbers. The relative comparison between policies is valid since both go through the same tunnels, though per-backend latency may differ slightly between AWS and GCP.
Finding the balance
Under high load, the router has to choose: keep sending to the backend with the warm cache and accept higher latency from queuing, or spill to another backend and redo the prefill work. The queue_threshold parameter sets that boundary. When the preferred backend’s queue exceeds the threshold, the router spills to the next best option. I swept it across four values at concurrency 32:
| Threshold | 429 Rejections | RPS | TTFT p50 |
|---|---|---|---|
| 4 | 194 / 200 | 6.7 | 401.9ms |
| 8 | 182 / 200 | 12.9 | 205.4ms |
| 16 | 175 / 200 | 22.3 | 185.3ms |
| 32 | 0 / 200 | 39.8 | 240.6ms |
At threshold 4, almost nothing got through: 194 out of 200 requests rejected. The few accepted requests had great latency, but the system barely served anything. At threshold 32, every request was accepted and throughput peaked at 39.8 rps. A real-time assistant that needs tight latency would set a low threshold and shed overflow. A batch pipeline would set it high and tolerate the tail latency.
Graceful degradation
When the router has to reject traffic, priority tiers control what gets rejected first. Requests over the queue threshold are immediately 429’d, but high-priority requests bypass the check entirely, and low-priority requests start getting rejected at 50% of the threshold. I ran a mix of priorities (20% high, 60% normal, 20% low) with a threshold of 16.
At concurrency 8, everything fit. At concurrency 32, the system was saturated and rejected 146 out of 200 requests. Every high and normal-priority request still got through. The rejected requests were all low-priority. Instead of all traffic degrading equally, low-value requests absorbed the impact first, preserving capacity and cache affinity for the rest.
At production scale
The router I built treats each backend as a single unit that handles the entire request. At production scale, you can do better. Prefill and decode have different hardware profiles: prefill is compute-bound, decode is memory-bandwidth-bound. Instead of running both on the same GPU, production systems can split them across separate pools specialized for each phase. A connector service transfers the KV cache between them. The routing decision also splits: which pool handles prefill, and which handles decode.
Wrapping up
This series traced an inference request from the gateway, through the GPU, and into the routing layer. The main thing I took away: routing decisions in inference serving depend on what the GPU has cached, and overload doesn’t just slow things down, it destroys work that was already done. Building something to test these tradeoffs made them concrete. The router code, Terraform modules, benchmark data, and learning notes are all in the companion repo.