LLM Fine-Tuning
Fine-tuning is the correct solution for format adherence and style injection, but it is almost always the wrong solution for teaching a model new facts. For facts, use RAG; for behavior, use fine-tuning.
The Hierarchy of Model Improvement
1. **Prompt Engineering:** Costs $0.00. Solves 80% of issues.
2. **Few-Shot Prompting:** Costs tokens. Solves 10% more.
3. **RAG:** Costs infra + tokens. Solves factual drift.
4. **Fine-Tuning:** Costs compute + engineering time. Required only for 99.9% reliability on complex schemas or deep domain vernacular.
Parameter-Efficient Fine-Tuning (PEFT)
Full fine-tuning (updating all weights) is prohibitively expensive for most teams. **LoRA (Low-Rank Adaptation)** and its quantized sibling **QLoRA** are the production standards. They freeze the base model and train small adapter matrices ($A$ and $B$) that are injected into the attention layers.
| Technique | VRAM (7B model) | Accuracy Impact | Training Time |
|---|---|---|---|
| **Full Fine-Tuning** | >160 GB | Baseline | Slow |
| **LoRA** | ~24-28 GB | Negligible | Fast |
| **QLoRA (4-bit)** | ~12-16 GB | 1-2% drop | Moderate |
QLoRA Hyperparameter Recipe
For a Llama 3.1 8B or Mistral 7B model on a single 24GB A10G or 3090/4090:
```python
Reference config for Hugging Face PEFT/TRL
config = LoraConfig(
r=16, # Rank: Higher = more capacity, but higher VRAM
lora_alpha=32, # Scaling factor: typically 2 * r
target_modules=[ # Target ALL linear layers for best results
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
```
Data: The Hardest Part
Fine-tuning is extremely sensitive to data quality. **500 high-quality, human-curated examples** will outperform 50,000 synthetic examples every time.
- **Diversity is Mandatory:** If your dataset uses the same prompt template 90% of the time, the model will overfit to the template, not the task. Mix in 10-20% general instruction data (e.g., SlimOrca or ShareGPT) to prevent "catastrophic forgetting"—where the model loses general reasoning ability.
- **Negatives are Critical:** Include examples where the correct answer is "I don't know" or a refusal. Without them, the model will learn to hallucinate an answer for any out-of-distribution input.
Evaluation and Convergence
Never trust training loss. A model can have zero training loss but fail in production because it simply memorized the dataset (overfitting).
1. **Eval Loss:** Track loss on a held-out set (10% of data). If eval loss starts rising while training loss falls, stop training immediately.
2. **Benchmark Regression:** After fine-tuning, run the model through a general benchmark (e.g., MMLU). A drop of >5% suggests you've over-fitted and destroyed the model's base intelligence.
3. **Format Validation:** If fine-tuning for JSON, run 1000 test cases through a schema validator. Aim for >99.5% validity.
Serving Fine-Tuned Models
Do not merge the LoRA weights into the base model if you have multiple tasks. Serve the base model with **vLLM or LoRAX**, which allow you to swap adapters dynamically at request time with negligible latency overhead.
Further Reading
- [ModelQuantization](ModelQuantization) — Deep dive into NF4 and 4-bit loading.
- [LlmEvaluationMetrics](LlmEvaluationMetrics) — How to build a custom eval harness.
- [RagImplementationPatterns](RagImplementationPatterns) — Why you should probably do RAG instead.