LLM Inference at Scale: Kubernetes, GPUs, and Keeping Costs Sane
Running LLMs in production on your own infrastructure is genuinely hard. This is what we've learned deploying and operating self-hosted models at scale.
LLM Inference at Scale: Kubernetes, GPUs, and Keeping Costs Sane
Running inference against the OpenAI API is easy. Running your own LLMs in production is a different discipline entirely — one that sits at the intersection of ML engineering, platform engineering, and cost management.
Here is what we’ve learned operating self-hosted LLMs across several production deployments.
When Self-Hosting Makes Sense
Self-hosting LLMs is not always the right answer. The API providers have invested billions in inference infrastructure. You are not going to out-operate them on cost at small scale.
Self-hosting makes sense when:
- Data privacy requirements prohibit sending data to third-party APIs
- Latency requirements demand sub-100ms token generation
- Volume is high enough that per-token API costs exceed infrastructure costs (typically >$50k/month in API spend)
- Fine-tuned models are a core part of your product
If none of these apply, use the API.
The GPU Infrastructure Stack
For Kubernetes-based LLM inference, the stack we reach for:
Node provisioning — GPU nodes on AWS (p4d, p3, g5 families) or Azure (NC, ND series). Spot/preemptible instances for batch inference, on-demand for latency-sensitive serving.
GPU operator — NVIDIA GPU Operator installs and manages GPU drivers across your cluster automatically. Do not manage GPU drivers manually.
Inference server — vLLM is our default choice. It implements PagedAttention for efficient KV cache management and delivers 2-4x throughput improvement over naive inference. For smaller models, Ollama is simpler to operate.
Model serving — KServe provides a Kubernetes-native serving layer with autoscaling, canary deployments, and a standard inference API. Wraps vLLM cleanly.
Autoscaling GPU Workloads
Standard Kubernetes HPA doesn’t work well for GPU workloads — CPU and memory metrics don’t capture GPU utilisation or queue depth.
The pattern that works:
- Expose a custom metric: pending requests in the inference queue
- Configure HPA to scale on this metric
- Set scale-to-zero for off-hours (with a warm-up tolerance on scale-up)
- Use Karpenter for node autoscaling — it provisions GPU nodes in response to pending pods faster than Cluster Autoscaler
Cold start time for a 7B parameter model is 3-5 minutes including node provisioning and model loading. Plan for this in your SLA.
Cost Management
GPU compute is expensive. The levers that matter:
Continuous batching — vLLM’s continuous batching processes multiple requests simultaneously, dramatically improving GPU utilisation. Without it, a GPU serving one request at a time has terrible utilisation.
Quantisation — INT8 and INT4 quantisation reduces model memory footprint by 2-4x, allowing larger models on smaller GPUs or more concurrent requests. Quality degradation is minimal for most use cases.
Model tiering — route simple requests to smaller, cheaper models (7B) and complex requests to larger models (70B). A classifier in front of your inference stack that routes based on complexity can cut costs by 40-60%.
Spot instances — for batch inference workloads, spot GPU instances on AWS can be 70% cheaper than on-demand. Implement checkpointing and retry logic to handle interruptions.
Observability for LLM Serving
Standard application metrics are not enough. You also need:
- Token throughput — tokens generated per second, per model, per node
- Time to first token — critical for streaming responses; users perceive this as latency
- KV cache utilisation — high cache hit rates mean vLLM is efficiently reusing computation
- Queue depth — leading indicator of capacity problems before latency degrades
- Model error rates — OOM errors, CUDA errors, and timeout rates by model
We push these metrics to Prometheus and visualise in Grafana alongside standard infrastructure metrics.
Lessons Learned
Don’t underestimate model loading time. A 70B model takes 3-4 minutes to load from S3 into GPU memory. This kills cold-start latency. Pre-warm instances during low-traffic periods.
GPU memory is the bottleneck, not compute. Most inference is memory-bandwidth bound. Optimise for fitting more model into GPU memory (quantisation, model sharding) before buying more compute.
Multi-GPU serving is hard. Tensor parallelism across multiple GPUs introduces network overhead and complexity. Stay on single-GPU deployments for as long as possible.
Running or planning to run LLMs on your own infrastructure? Get in touch — we design and operate GPU inference platforms on Kubernetes.