Multimodal Embeddings

Multimodal embeddings map inputs from distinct sensors (cameras, microphones, text streams) into a single, shared $d$ -dimensional vector space. In this space, the semantic distance between the string "firewall logs showing exfiltration" and a screenshot of a Grafana dashboard showing a spike in outbound traffic is minimized.

1. Architecture: The Projection Layer

Modern multimodal models (CLIP, SigLIP, ImageBind) consist of independent encoders for each modality, followed by a projection layer that aligns their outputs.

CLIP (Contrastive Language-Image Pretraining)

CLIP uses a dual-encoder architecture. The training objective is to maximize the cosine similarity of $N$ correct pairs in a batch while minimizing the similarity of the $N^2 - N$ incorrect pairs.

\mathcal{L} = \frac{1}{2N} \sum_{i=1}^N \left( \log \frac{\exp(\cos(\mathbf{t}_i, \mathbf{v}_i)/\tau)}{\sum_{j=1}^N \exp(\cos(\mathbf{t}_i, \mathbf{v}_j)/\tau)} + \log \frac{\exp(\cos(\mathbf{t}_i, \mathbf{v}_i)/\tau)}{\sum_{j=1}^N \exp(\cos(\mathbf{t}_j, \mathbf{v}_i)/\tau)} \right)

where $\mathbf{t}$ is text, $\mathbf{v}$ is vision, and $\tau$ is a learnable temperature parameter.

SigLIP (Sigmoid Loss)

SigLIP (Google, 2023) replaces the softmax over the whole batch with a simple pairwise sigmoid loss. This allows for much larger batch sizes and better stability on small GPUs.

2. Implementation: Zero-Shot Classification

The primary advantage of multimodal embeddings is classification without retraining.

import torch
from open_clip import create_model_and_transforms, get_tokenizer

model, _, preprocess = create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = get_tokenizer('ViT-B-32')

def classify_image(image, labels):
    # 1. Preprocess and Encode Image
    image_input = preprocess(image).unsqueeze(0)
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        image_features /= image_features.norm(dim=-1, keepdim=True)

    # 2. Encode Text Labels
    text_inputs = tokenizer(labels)
    with torch.no_grad():
        text_features = model.encode_text(text_inputs)
        text_features /= text_features.norm(dim=-1, keepdim=True)

    # 3. Compute Probabilities
    # Similarity is the dot product (cosine similarity since normalized)
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    return {label: prob.item() for label, prob in zip(labels, similarity[0])}

# Usage
labels = ["a network diagram", "a code snippet", "a natural landscape", "a security alert"]
results = classify_image(img, labels)

3. Production Retrieval Patterns

For technical documentation, retrieval must span text and diagrams.

Ingestion: For every image/table in a document, generate a multimodal embedding.
Storage: Store in pgvector or Qdrant using HNSW.
Query: The user asks "show me the load balancer configuration". We embed this text query and search across both the text chunks and the image embeddings.

The "Late Fusion" Trap

Avoid separate text and image search results that are merely concatenated. Use Reciprocal Rank Fusion (RRF) to combine the dense multimodal scores with traditional BM25 text scores if the query contains specific identifiers (like UUIDs or filenames).

4. Design Trade-offs

Feature	Low Dimension (256-512)	High Dimension (1024+)
Memory	Efficient (Fits in RAM)	High (Requires SSD-backed index)
Search Speed	Sub-millisecond	Linear with dimensionality
Nuance	Coarse (Good for general topics)	High (Can distinguish subtle UI changes)
Model Size	Mobile-friendly	Data-center required

5. Deployment Checklist

Normalization: Always normalize vectors to unit length ( $L_2$ ) before indexing if using Cosine Similarity.
Quantization: Use int8 or Binary Quantization for the vector index to reduce memory footprint by 4x-32x with $<1\%$ recall drop.
OCR Pre-processing: Multimodal models like CLIP are notoriously bad at reading small text within images. For screenshots, append OCR-extracted text to the image metadata for hybrid search.