Document Clustering Approaches
Document clustering surfaces latent themes in unstructured text without manual labeling. In the age of LLMs, clustering is primarily used for **Topic Discovery**, **Data Pruning** (removing redundant training samples), and **Semantic Navigation**.
The Modern NLP Pipeline: BERTopic Pattern
The most robust architecture for clustering documents involves a four-step pipeline:
1. **Embedding:** Convert text into dense vectors (e.g., `all-MiniLM-L6-v2` or `BGE-base`).
2. **Dimensionality Reduction (UMAP):** High-dimensional vector space is sparse and prone to the "curse of dimensionality." UMAP reduces 768 dimensions to 5-10 while preserving local structure.
3. **Clustering (HDBSCAN):** Unlike K-Means, HDBSCAN does not require you to pre-define the number of clusters. It identifies dense "islands" of points and labels outliers as noise.
4. **Representation (c-TF-IDF):** Extract keywords from each cluster to give it a human-readable name.
```python
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
Configure for density-based clustering
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)
topic_model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
calculate_probabilities=True
)
topics, probs = topic_model.fit_transform(my_documents)
```
Comparison of Algorithms
| Algorithm | Pros | Cons | Best for |
|---|---|---|---|
| **K-Means** | Fast, simple. | Must pick $K$; assumes spherical clusters. | Balanced, well-known datasets. |
| **HDBSCAN** | Handles outliers; no $K$ needed. | Computationally intensive for >1M rows. | Noisy, real-world text. |
| **LDA** | Probabilistic; interprets well. | Struggles with short text; needs heavy tuning. | Old-school "bag of words" corpora. |
Handling Outliers (The -1 Label)
A common mistake in clustering is forcing every document into a cluster. HDBSCAN assigns a `-1` label to documents that are too distant from any dense neighborhood.
**Practitioner Strategy:** If >30% of your data is labeled -1, your embedding model is likely too general or your `min_cluster_size` is too high. Do not force-cluster outliers; they are often the most valuable signal for "emerging topics" or "garbage data."
Evaluation: Beyond Visualizing
Don't just look at a 2D scatter plot. Use quantitative metrics:
- **Coherence Score:** Do the top words in a cluster actually make sense together?
- **Silhouette Score:** How well-separated are the clusters in vector space?
- **Stability:** If you re-run the clustering on a 90% sample, do the same clusters emerge?
Advanced Case: Clustering for RAG
Use clustering to build a "Table of Contents" for a RAG system. Instead of searching a flat list of 1 million chunks, the agent can first identify the relevant **Cluster Centroid**, then search only the documents within that cluster. This reduces noise and improves retrieval speed.
Further Reading
- [EmbeddingsVectorDB](EmbeddingsVectorDB) — Choosing the right model for step 1.
- [KnowledgeGraphCompletion](KnowledgeGraphCompletion) — Turning clusters into structured entities.
- [AnomalyDetectionTechniques](AnomalyDetectionTechniques) — Detecting the -1 "Noise" points.