Handling Embedding Service Outages

Hybrid retrieval is designed fail-closed: when the embedding service goes away, the system reverts to BM25 with no user-visible error and no service interruption. The trade-off is a quality drop that agents need to detect.

When to use this runbook

When retrieval is misbehaving and you suspect (or want to confirm) it's the embedding side.

Context

HybridSearchService.rerank() is the integration point. It calls the embedding service for each query, fuses dense scores with BM25 via Reciprocal Rank Fusion (k=60), and returns the merged result list. When the embedding call throws, the dense list is treated as empty and RRF reduces to BM25 ranks.

The local default embedding service is Ollama running nomic-embed-v1.5 on port 11434. Misbehaviour usually traces to (a) the Ollama process crashed, (b) the model wasn't loaded, or (c) GPU/memory pressure delaying inference.

Walkthrough

The frontmatter steps are the canonical diagnostic sequence: confirm Ollama, restart if needed, look at logs, broaden the query while in fallback, check metrics.

Pitfalls

The frontmatter pitfalls capture the failure modes. The "BM25 fallback is a defect" misreading is the most common — it's a feature, and chasing it as a bug wastes time.