Multimodal Embeddings
Multimodal embeddings map inputs from distinct sensors (cameras, microphones, text streams) into a single, shared $d$-dimensional vector space. In this space, the semantic distance between the string `"firewall logs showing exfiltration"` and a screenshot of a Grafana dashboard showing a spike in outbound traffic is minimized.
1. Architecture: The Projection Layer
Modern multimodal models (CLIP, SigLIP, ImageBind) consist of independent encoders for each modality, followed by a **projection layer** that aligns their outputs.
CLIP (Contrastive Language-Image Pretraining)
CLIP uses a dual-encoder architecture. The training objective is to maximize the cosine similarity of$N$correct pairs in a batch while minimizing the similarity of the$N^2 - N$incorrect pairs.$$\mathcal{L} = \frac{1}{2N} \sum_{i=1}^N \left( \log \frac{\exp(\cos(\mathbf{t}_i, \mathbf{v}_i)/\tau)}{\sum_{j=1}^N \exp(\cos(\mathbf{t}_i, \mathbf{v}_j)/\tau)} + \log \frac{\exp(\cos(\mathbf{t}_i, \mathbf{v}_i)/\tau)}{\sum_{j=1}^N \exp(\cos(\mathbf{t}_j, \mathbf{v}_i)/\tau)} \right)$$where$\mathbf{t}$is text,$\mathbf{v}$is vision, and$\tau$is a learnable temperature parameter.
SigLIP (Sigmoid Loss)
SigLIP (Google, 2023) replaces the softmax over the whole batch with a simple pairwise sigmoid loss. This allows for much larger batch sizes and better stability on small GPUs.
2. Implementation: Zero-Shot Classification
The primary advantage of multimodal embeddings is classification without retraining.
```python
import torch
from open_clip import create_model_and_transforms, get_tokenizer
model, _, preprocess = create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = get_tokenizer('ViT-B-32')
def classify_image(image, labels):
1. Preprocess and Encode Image
image_input = preprocess(image).unsqueeze(0)
with torch.no_grad():
image_features = model.encode_image(image_input)
image_features /= image_features.norm(dim=-1, keepdim=True)
2. Encode Text Labels
text_inputs = tokenizer(labels)
with torch.no_grad():
text_features = model.encode_text(text_inputs)
text_features /= text_features.norm(dim=-1, keepdim=True)
3. Compute Probabilities
Similarity is the dot product (cosine similarity since normalized)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
return {label: prob.item() for label, prob in zip(labels, similarity[0])}
Usage
labels = ["a network diagram", "a code snippet", "a natural landscape", "a security alert"]
results = classify_image(img, labels)
```
3. Production Retrieval Patterns
Cross-Modal RAG
For technical documentation, retrieval must span text and diagrams.
* **Ingestion:** For every image/table in a document, generate a multimodal embedding.
* **Storage:** Store in `pgvector` or `Qdrant` using HNSW.
* **Query:** The user asks `"show me the load balancer configuration"`. We embed this text query and search across *both* the text chunks and the image embeddings.
The "Late Fusion" Trap
Avoid separate text and image search results that are merely concatenated. Use **Reciprocal Rank Fusion (RRF)** to combine the dense multimodal scores with traditional BM25 text scores if the query contains specific identifiers (like UUIDs or filenames).
4. Design Trade-offs
| Feature | Low Dimension (256-512) | High Dimension (1024+) |
|---|---|---|
| **Memory** | Efficient (Fits in RAM) | High (Requires SSD-backed index) |
| **Search Speed** | Sub-millisecond | Linear with dimensionality |
| **Nuance** | Coarse (Good for general topics) | High (Can distinguish subtle UI changes) |
| **Model Size** | Mobile-friendly | Data-center required |
5. Deployment Checklist
1. **Normalization:** Always normalize vectors to unit length ($L_2$) before indexing if using Cosine Similarity.
2. **Quantization:** Use `int8` or Binary Quantization for the vector index to reduce memory footprint by 4x-32x with$<1\%$ recall drop.
3. **OCR Pre-processing:** Multimodal models like CLIP are notoriously bad at reading small text within images. For screenshots, append OCR-extracted text to the image metadata for hybrid search.
Further Reading
* [EmbeddingsVectorDB](EmbeddingsVectorDB)
* [VectorDatabases](VectorDatabases)
* [AiPoweredSearch](AiPoweredSearch)
* [HybridRetrieval](HybridRetrieval)