Cost-Effective Inference
Training models is expensive but bounded; inference is unbounded — cost scales with usage. For successful products, inference cost dominates total ML spend.
This page covers practical levers for reducing inference cost.
The cost equation
Inference cost ≈ (compute per request) × (requests) / (compute per dollar)
You can attack any term:
- Reduce compute per request (smaller model, optimization)
- Reduce requests (caching, batching, prefiltering)
- Increase compute per dollar (better hardware, spot pricing)
Model selection
The cheapest optimization: use a smaller model.
For LLMs:
- GPT-4 → GPT-4-mini: 10-50x cheaper, often acceptable
- Claude Opus → Claude Haiku: similar tradeoff
- 70B → 7B model: ~10x cheaper
For traditional ML:
- Deep neural net → gradient boosting → linear model
Always test: many tasks don't need the strongest model.
Quantization
Reduce numerical precision of weights and activations.
- FP32 → FP16: 2x memory, ~2x speed
- FP16 → INT8: 2x memory, ~2x speed
- INT8 → INT4: 2x memory, marginal speed (depends on hardware)
Quality impact is usually minimal for FP16 and INT8. INT4 needs care.
Tools: bitsandbytes, GPTQ, AWQ, llama.cpp.
Distillation
Train a smaller "student" model to mimic a larger "teacher."
Common approach:
1. Run teacher on lots of data
2. Train student on (input, teacher_output) pairs
3. Deploy student
Works well for many tasks. Requires the teacher and good training infrastructure.
Pruning
Remove weights that contribute little. Unstructured pruning saves memory but rarely speed; structured pruning (entire heads, layers) speeds inference.
Batching
Process multiple requests together. Modern GPUs are heavily underutilized at batch size 1.
Static batching: collect N requests, run together. Adds latency.
Dynamic batching: form batches at the inference engine. Used in vLLM, TGI.
Continuous batching: especially for autoregressive models, allows joining/leaving batches mid-generation. Major throughput gain.
Caching
Response caching
Identical request? Return cached response.
Hash the request (or relevant parts) as cache key.
Works best for deterministic outputs.
Prompt caching
For LLMs: cache the prefix computation. New requests reusing the prefix skip recomputation.
Anthropic, OpenAI, and others now offer this directly.
Major savings for long system prompts or RAG with repeated context.
See [PromptCaching](PromptCaching).
Semantic caching
Cache based on semantic similarity, not exact match. "What's the capital of France?" and "Capital of France?" share an answer.
Use embeddings + nearest neighbor lookup.
Risk: false-positive matches return wrong answers.
Speculative decoding
For LLMs: a small "draft" model proposes tokens; the large model verifies in parallel.
Net effect: same outputs, fewer large-model forward passes. 2-3x speedup typical.
Routing
Use multiple models tiered by capability:
- Cheap model handles 80% of queries
- Escalate to expensive model only when needed
Routing logic ranges from rules to learned classifiers.
Hardware choices
GPUs
A100, H100: high throughput, expensive
A10, L4: mid-tier
T4: budget, still capable
For LLMs, memory bandwidth often matters more than FLOPs.
CPUs
For small models or non-latency-sensitive workloads, CPU inference is often cheaper.
Modern CPUs with AVX-512 / AMX can run quantized LLMs surprisingly well.
See [CPUInference](CPUInference).
Accelerators
TPUs, Inferentia, Groq — specialized hardware can offer better cost/performance for some workloads.
Spot/preemptible instances
For batch inference: 50-90% cheaper but can be interrupted.
Batch vs real-time
If you don't need real-time:
- Batch overnight on cheap hardware
- Use spot instances
- Larger batch sizes
Many use cases don't actually need real-time.
API vs self-hosted
API providers offer:
- Zero ops cost
- Latest models
- Pay-per-use
Self-hosted:
- Lower per-token cost at high volume
- Custom models
- Data privacy
- Operational burden
The breakeven varies. Many teams underestimate self-hosting ops cost.
Measurement
Without metrics, optimization is guesswork. Track:
- Cost per request
- p50/p95/p99 latency
- Requests per second
- GPU utilization
- Cache hit rate
Common failure patterns
Premature optimization
Hand-tuning quantization for a model you'll replace next month wastes effort.
Ignoring the cheap wins
Caching often saves 50%+ with little engineering.
Over-engineering routing
Complex routing systems can cost more in maintenance than they save in inference.
Not measuring quality after optimization
Quality regressions from quantization, distillation, or routing can be subtle.
Using the wrong model
The strongest model is rarely needed. Test smaller models.
Decision order
1. Choose the smallest model that works
2. Add caching aggressively
3. Quantize as much as quality allows
4. Batch where latency permits
5. Consider distillation for large-volume tasks
6. Optimize hardware/deployment
Further Reading
- [InferenceServing](InferenceServing) — Serving infrastructure
- [CPUInference](CPUInference) — CPU-based deployment
- [PromptCaching](PromptCaching) — Caching for LLMs
- [ModelSelectionEfficiency](ModelSelectionEfficiency) — Model size tradeoffs
- [ML Hub](MLHub) — Cluster index