All Articles

LLM Inference at Scale: Kubernetes, GPUs, and Keeping Costs Sane

Running LLMs in production on your own infrastructure is genuinely hard. This is what we've learned deploying and operating self-hosted models at scale.

R2
R2SA Technologies
· · 11 min read

LLM Inference at Scale: Kubernetes, GPUs, and Keeping Costs Sane

Running inference against the OpenAI API is easy. Running your own LLMs in production is a different discipline entirely — one that sits at the intersection of ML engineering, platform engineering, and cost management.

Here is what we’ve learned operating self-hosted LLMs across several production deployments.

When Self-Hosting Makes Sense

Self-hosting LLMs is not always the right answer. The API providers have invested billions in inference infrastructure. You are not going to out-operate them on cost at small scale.

Self-hosting makes sense when:

  • Data privacy requirements prohibit sending data to third-party APIs
  • Latency requirements demand sub-100ms token generation
  • Volume is high enough that per-token API costs exceed infrastructure costs (typically >$50k/month in API spend)
  • Fine-tuned models are a core part of your product

If none of these apply, use the API.

The GPU Infrastructure Stack

For Kubernetes-based LLM inference, the stack we reach for:

Node provisioning — GPU nodes on AWS (p4d, p3, g5 families) or Azure (NC, ND series). Spot/preemptible instances for batch inference, on-demand for latency-sensitive serving.

GPU operator — NVIDIA GPU Operator installs and manages GPU drivers across your cluster automatically. Do not manage GPU drivers manually.

Inference server — vLLM is our default choice. It implements PagedAttention for efficient KV cache management and delivers 2-4x throughput improvement over naive inference. For smaller models, Ollama is simpler to operate.

Model serving — KServe provides a Kubernetes-native serving layer with autoscaling, canary deployments, and a standard inference API. Wraps vLLM cleanly.

Autoscaling GPU Workloads

Standard Kubernetes HPA doesn’t work well for GPU workloads — CPU and memory metrics don’t capture GPU utilisation or queue depth.

The pattern that works:

  1. Expose a custom metric: pending requests in the inference queue
  2. Configure HPA to scale on this metric
  3. Set scale-to-zero for off-hours (with a warm-up tolerance on scale-up)
  4. Use Karpenter for node autoscaling — it provisions GPU nodes in response to pending pods faster than Cluster Autoscaler

Cold start time for a 7B parameter model is 3-5 minutes including node provisioning and model loading. Plan for this in your SLA.

Cost Management

GPU compute is expensive. The levers that matter:

Continuous batching — vLLM’s continuous batching processes multiple requests simultaneously, dramatically improving GPU utilisation. Without it, a GPU serving one request at a time has terrible utilisation.

Quantisation — INT8 and INT4 quantisation reduces model memory footprint by 2-4x, allowing larger models on smaller GPUs or more concurrent requests. Quality degradation is minimal for most use cases.

Model tiering — route simple requests to smaller, cheaper models (7B) and complex requests to larger models (70B). A classifier in front of your inference stack that routes based on complexity can cut costs by 40-60%.

Spot instances — for batch inference workloads, spot GPU instances on AWS can be 70% cheaper than on-demand. Implement checkpointing and retry logic to handle interruptions.

Observability for LLM Serving

Standard application metrics are not enough. You also need:

  • Token throughput — tokens generated per second, per model, per node
  • Time to first token — critical for streaming responses; users perceive this as latency
  • KV cache utilisation — high cache hit rates mean vLLM is efficiently reusing computation
  • Queue depth — leading indicator of capacity problems before latency degrades
  • Model error rates — OOM errors, CUDA errors, and timeout rates by model

We push these metrics to Prometheus and visualise in Grafana alongside standard infrastructure metrics.

Lessons Learned

Don’t underestimate model loading time. A 70B model takes 3-4 minutes to load from S3 into GPU memory. This kills cold-start latency. Pre-warm instances during low-traffic periods.

GPU memory is the bottleneck, not compute. Most inference is memory-bandwidth bound. Optimise for fitting more model into GPU memory (quantisation, model sharding) before buying more compute.

Multi-GPU serving is hard. Tensor parallelism across multiple GPUs introduces network overhead and complexity. Stay on single-GPU deployments for as long as possible.


Running or planning to run LLMs on your own infrastructure? Get in touch — we design and operate GPU inference platforms on Kubernetes.

Ready to build something exceptional?

Whether you need a platform engineer, cloud architect, or technical leader — let's talk about how we can help your team move faster.