Designing an Inference Gateway

I’ve been using Claude Code a lot lately and got curious about the infrastructure behind it. This post traces a request through the inference gateway, the layer between you and the model. I work in distributed systems but haven’t built or operated one of these, so this is based on research, not production experience. If I’ve gotten something wrong, I’d love to hear from you in the GitHub Discussions. The code for this series is in the companion repo.

What is an inference gateway?

An inference gateway is the system that sits between clients and the GPU servers running the model.

This layer exists because GPU hardware is expensive and specialized. An H100 costs tens of thousands of dollars, so you want it doing matrix multiplications, not parsing JSON or checking rate limits. The gateway handles everything else: routing, overload protection, caching, and keeping things running when stuff breaks. Gateway instances are cheap, stateless, and horizontally scalable, so it’s a natural place to put all that work.

In practice, a production inference gateway runs as a fleet of stateless instances behind a load balancer, distributing client traffic across gateway nodes that each talk to a shared pool of GPU backends. The architecture looks something like this:

Inference Gateway Architecture

Let’s trace a request through this system.

The Client

You’re working in Claude Code, typing a prompt and hitting enter. Or you’re in the claude.ai UI, sending a message in a conversation. Either way, your client makes an API request to the inference endpoint. Under the hood, the request looks something like this:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "messages": [
    {"role": "user", "content": "Explain how an inference gateway works"}
  ]
}

The Load Balancer

The request hits a fleet of load balancer hosts. They distribute incoming traffic across the gateway instances. They’re stateless and horizontally scalable, and there’s no inference-specific logic here, just standard traffic distribution.

The Inference Gateway

The request lands on one of the gateway instances. The gateway is responsible for everything that isn’t inference itself: authentication, rate limiting, input and output safety screening, request queuing, and observability. Once the request clears those checks, the gateway decides which GPU server to send it to and forwards the request.

Gateway Pipeline

Tokens stream back from the GPU server through the gateway, which screens the output for safety before relaying it to the client. Time-to-first-token matters for the user experience, so the gateway streams tokens as they arrive rather than buffering the full response.

The GPU Server

The request arrives at a GPU server: a machine with one or more GPUs attached. The CPU runs a web server and an inference engine (vLLM, TGI), the GPU does the actual matrix math. VRAM is the critical resource: it holds model weights and the KV cache for active requests.

GPU Server Internals

The inference engine owns batching and scheduling. vLLM can add new requests to an in-progress batch and retire completed ones on the fly, because it has direct visibility into VRAM state. The gateway doesn’t have that visibility, so it doesn’t try. There’s a lot more going on inside the GPU server (KV cache management, tensor parallelism, quantization), but that’s a topic for another post. For now, what matters is the interface: the gateway sends requests, the server handles inference.

What’s Next

That’s the end-to-end picture: a request travels from your client through a load balancer, into the gateway where it’s authenticated, rate limited, safety checked, and routed, then on to a GPU server for inference, and back through the gateway as a stream of tokens.

Each of those steps has interesting problems underneath it. How does the gateway decide which GPU server to route to, and why does that decision affect whether the server has to redo work it’s already done? How do you rate limit across a fleet of stateless gateway instances that don’t share memory? What does backpressure look like when a single request holds a connection open for seconds while tokens stream back one at a time? How do you screen output for safety when the response is a stream you’ve already started sending?

I’ll dig into these in a follow-up post.