← All thoughts

Cloud Inference

Part 1: Designing an Inference Gateway

February 18, 2026 ·
#ai#distributed-systems

Tracing a request through the system that sits between you and the LLM.
Part 2: Inside the GPU Server

March 21, 2026 ·
#ai#distributed-systems

Weight streaming, KV caches, and why inference routing is a different problem.
Part 3: Routing Inference Requests

March 23, 2026 ·
#ai#distributed-systems

Building a cache-aware router, pointing it at real GPUs, and measuring the tradeoff between latency and throughput.