Deploying Ollama: Local Inference Infrastructure
Ollama is the primary tool for running Large Language Models (LLMs) locally with minimal overhead. For production use, it requires careful hardware sizing and service-level management.
Hardware Sizing: vRAM Requirements
The most critical factor in local inference is the available video RAM (vRAM). If a model does not fit in vRAM, it offloads to system RAM, which is significantly slower (often 10x-50x slower).
Llama 3 vRAM Table (Approximate)
| Model Size | Quantization | vRAM Required | Recommended Hardware |
|---|---|---|---|
| **Llama 3 8B** | 4-bit (Q4_K_M) | ~5.5 GB | RTX 3060 (12GB) / Apple M1+ |
| **Llama 3 8B** | 8-bit (Q8_0) | ~9.0 GB | RTX 3080 (10GB+) / Apple M1+ |
| **Llama 3 70B** | 4-bit (Q4_K_M) | ~40.0 GB | 2x RTX 3090/4090 (48GB) / A6000 |
| **Llama 3 70B** | 8-bit (Q8_0) | ~72.0 GB | 2x A6000 / A100 / Mac Studio (128GB) |
**Note on Unified Memory:** Apple Silicon (Mac Studio/Pro) uses unified memory, meaning system RAM can be allocated to the GPU. For 70B models, a Mac with 64GB+ RAM is often the most cost-effective local solution.
Production Deployment: systemd
On Linux, Ollama should run as a systemd service to ensure it restarts after crashes or reboots.
Example Service File
Create `/etc/systemd/system/ollama.service`:
```ini
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"
[Install]
WantedBy=default.target
```
**Concrete Command:** After creating the file, enable and start the service:
```bash
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
```
GPU Pass-through with Docker
To run Ollama inside Docker with GPU acceleration, you must install the **NVIDIA Container Toolkit**.
Docker Compose Configuration
```yaml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ./ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
```
Advanced Configuration: Model Customization
Use a `Modelfile` to bake system prompts and parameters into a custom model tag.
Concrete Example: Creative Assistant
Create a file named `CreativeAssistant.Modelfile`:
```dockerfile
FROM llama3:8b
Set creativity parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9
Set the persona
SYSTEM """
You are a creative writing assistant. You favor vivid imagery and metaphor.
Keep responses under 200 words unless asked otherwise.
"""
```
Then create the model:
```bash
ollama create creative-llama -f CreativeAssistant.Modelfile
ollama run creative-llama "Describe a cyberpunk city in the rain."
```
Monitoring Performance
Use the `OLLAMA_DEBUG=1` environment variable to see detailed logging of which layers are being offloaded to the GPU. During inference, check GPU utilization with:
```bash
nvidia-smi -l 1
```
Look for **Volatile GPU-Util** and **Memory-Usage** to confirm the model is fully resident in vRAM.