Model Selection for Efficiency

The biggest model that fits your data isn't always the right model. For production, efficiency often matters more than peak accuracy.

This page covers selection with efficiency as a first-class concern.

The size/quality curve

Larger models perform better, with diminishing returns. The curve has different shapes per task:

Image classification: smooth curve; good small models exist
Language modeling: relatively smooth; emergent capabilities at scale
Specialized tasks: often plateau quickly

For most production tasks, you're far from the diminishing-returns asymptote with much smaller models than the largest.

Quality vs cost tradeoffs

You can almost always trade quality for cost:

Smaller model: less compute, lower quality
Quantization: less memory, sometimes lower quality
Distillation: smaller, with less quality drop
Pruning: less compute, sometimes quality preserved

The right tradeoff depends on:

Quality threshold (how much can you afford to drop?)
Cost sensitivity (how much does latency/$ matter?)
Volume (high volume amplifies cost differences)

Efficient architectures

Vision

MobileNet, EfficientNet: designed for efficiency
ConvNeXt-T: small, modern, strong
DistilBERT, MiniLM: distilled BERTs

Language

Distil family (DistilBERT, DistilGPT): 40% smaller, 95% performance
MiniLM: efficient embeddings
Phi family (Microsoft): small but strong
Gemma 2B, Llama 3.2 1B: small LLMs

Tabular

LightGBM: usually faster than XGBoost
Linear models: fastest; surprisingly often sufficient

Distillation

Train smaller "student" model to mimic larger "teacher."

Steps:

Train (or use) a strong teacher
Generate teacher predictions on lots of data
Train student to match teacher (and labels)

Often: 10x smaller, 5% quality loss.

Best when:

You have lots of unlabeled data
Teacher is much better than from-scratch student
Inference cost dominates training

Quantization

Reduce numerical precision:

FP16: usually free quality-wise; 2x speedup
INT8: ~1% quality drop typical; 2-4x speedup
INT4: 1-3% quality drop; further speedup

Different parts of the model tolerate different precision. Mixed-precision quantization optimizes.

For LLMs, INT4 quantization (GPTQ, AWQ) is standard for efficiency.

Pruning

Remove weights with minimal impact:

Unstructured pruning

Zero out individual weights. Reduces model size, but rarely speeds up inference (sparse compute is slow on most hardware).

Structured pruning

Remove whole heads, channels, layers. Speeds up real inference.

Generally less effective per-parameter than unstructured, but actually helps in production.

Sparse models

Mixture of Experts (MoE): only some experts active per input. Effective parameter count > active parameter count.

Used in large open models (Mixtral, DBRX). High parameter count; modest active compute.

Smaller pretrained models

Many tasks don't need the largest pretrained model:

Sentence-transformers/all-MiniLM-L6-v2 vs all-mpnet-base-v2: 3x smaller, ~3% quality drop on retrieval
DistilBERT vs BERT: similar story

Test with the smaller model first.

Architecture changes for efficiency

Smaller hidden dimensions

Fewer parameters per layer.

Fewer layers

Linear cost reduction.

Shared parameters

Same weights across layers (Universal Transformers).

Linear attention

Replace quadratic attention with linear. Variable quality impact.

These tradeoffs are model- and task-dependent.

Hardware-aware selection

The "best" model depends on your hardware:

GPU memory determines max model size
Memory bandwidth determines speed for large models
Compute determines speed for small models
CPU vs GPU economics differ dramatically

Profile candidate models on actual deployment hardware.

Latency budget

Set a latency budget upfront:

Inference: X ms
Preprocessing: Y ms
Network: Z ms

Eliminate models that don't fit. Don't pick the best then try to optimize down.

Cost projection

Estimate cost at production volume:

Cost per request × requests = total

Often the answer changes the model choice:

1 QPS: any model is cheap
100 QPS: cost differences matter
10K QPS: only efficient models viable

Model variants

Many model families come in size variants:

Llama 3: 1B, 3B, 8B, 70B, 405B
Mistral: 7B, 22B, 70B
Claude: Haiku, Sonnet, Opus

Try the smallest. Step up if needed.

Two-stage approaches

Use a small model first; escalate to large model when uncertain:

Confident small-model output → use it
Uncertain → escalate

Routing can save 80%+ inference cost while maintaining quality.

Common failure patterns

Choosing the strongest available model

The best benchmark model isn't always the best production model.

Not measuring cost

Cost feels abstract until the bill arrives.

Skipping efficiency analysis

"We'll optimize later" tends to mean never.

Quality drops after quantization

Test quality after every optimization.

Ignoring latency tail

p99 latency > p50 latency in user experience impact.

Insufficient testing of small models

Often the small model is enough; teams don't try.

Practical workflow

Define latency budget and cost budget
List candidate models that meet budget at typical sizes
Pick smallest plausible candidate
Evaluate quality
If insufficient, step up size
Apply efficiency optimizations (quantization, distillation) as needed
Profile end-to-end before deployment

When efficiency doesn't matter

Internal tools, low volume
Prototypes
Tasks where quality dwarfs cost

For these, just pick the best model and ship.