Clustering Algorithms
Clustering finds groups of similar items in unlabelled data. The algorithms make different assumptions about what "groups" look like; picking the wrong one produces strange results. This page is the working set, with the assumptions and failure modes of each.
K-means
The classic. Pick K; iteratively assign points to the nearest centroid; recompute centroids.
```
init centroids randomly
repeat:
assign each point to nearest centroid
update each centroid to mean of its points
until convergence
```
Assumes:
- Spherical clusters of roughly equal size.
- Clusters defined by Euclidean distance from a centroid.
- You know K in advance.
Strengths: fast (O(NKI)); simple; understood; widely-implemented.
Weaknesses:
- Bad on non-spherical clusters (elongated, U-shaped, nested).
- Bad on clusters of very different sizes.
- Sensitive to outliers (centroids shift toward them).
- K-means++ initialisation is critical; random init produces local minima.
Use K-means when: clusters are roughly spherical and you have a reasonable K.
Don't use when: clusters have variable density, shape, or size; outliers matter; you can't choose K.
K-medoids (PAM)
Like K-means but cluster centres are actual data points (medoids), not centroids. More robust to outliers.
Slower than K-means. Useful when your distance metric isn't Euclidean (you can use any distance) or when outliers are a problem.
Hierarchical clustering (agglomerative)
Start with each point as its own cluster; iteratively merge nearest clusters. Produces a dendrogram (tree).
```
each point is a cluster
repeat:
find the two nearest clusters
merge them
until one cluster
```
The "nearest" depends on linkage:
- **Single** — minimum distance between clusters. Produces stringy clusters.
- **Complete** — maximum distance. Compact clusters.
- **Average** — mean distance. Compromise.
- **Ward** — minimises within-cluster variance. Often produces best clusters; computationally similar to K-means.
Strengths: no need to choose K up front (cut the dendrogram at any level); produces a hierarchy that's interpretable.
Weaknesses: O(N²) memory; O(N³) time naively. Doesn't scale beyond ~10k-100k points.
Use when: you want hierarchy; data isn't huge.
DBSCAN
Density-based. A cluster is a region where points have many neighbours within distance `eps`. Points in low-density regions are noise.
Two parameters: `eps` (distance threshold) and `min_samples` (minimum density).
Strengths:
- Finds non-spherical clusters.
- Handles noise / outliers explicitly.
- No need to choose K.
Weaknesses:
- Sensitive to `eps`. Wrong choice produces all-noise or one-huge-cluster.
- Can't handle clusters of very different densities.
- O(N²) without indexing; with kd-tree / ball tree, O(N log N).
Use when: clusters have similar density; you don't know K; outliers are real.
HDBSCAN
Hierarchical DBSCAN. Builds a hierarchy then extracts clusters at varying density. Solves DBSCAN's "single eps" problem.
Strengths:
- Handles variable density.
- Doesn't need `eps`.
- Robust noise detection.
- Stable across hyperparameter ranges.
Weaknesses:
- More complex than DBSCAN.
- Slower than K-means but much smarter on real data.
Use when: real-world clustering, especially on embeddings. The default for document clustering, customer segmentation, anomaly detection. See [DocumentClusteringApproaches]().
Gaussian Mixture Models (GMM)
Assume data is generated from K Gaussian distributions with different means and covariances; learn the parameters.
Strengths:
- Soft clustering — each point has probabilities for each cluster.
- Handles elliptical clusters (not just spherical).
- Probabilistic framing — useful for uncertainty quantification.
Weaknesses:
- Need to choose K.
- Slower than K-means.
- Failure on non-Gaussian data (clusters with funny shapes).
Use when: clusters are roughly elliptical; soft assignment matters; you can choose K with cross-validation or BIC.
Spectral clustering
Build a similarity graph; compute eigenvectors of its Laplacian; cluster in eigenvector space.
Strengths:
- Handles arbitrary cluster shapes (gracefully clusters concentric circles, half-moons, etc.).
- Theoretically well-founded.
Weaknesses:
- O(N³) eigendecomposition. Doesn't scale.
- Sensitive to choice of similarity graph and number of components.
Use when: data has clear graph structure; small N; classical methods fail on the geometry.
Mean-shift
Like K-means but each point shifts toward the local density mode. Number of clusters emerges from the data.
Strengths: no K; finds modes of any shape.
Weaknesses: O(N²); bandwidth parameter is sensitive; rarely the best choice.
OPTICS
Density-based like DBSCAN; produces a reachability plot from which you can extract clusters at varying density.
Less popular than HDBSCAN now; HDBSCAN is the modern preferred choice.
BIRCH
Streaming-friendly clustering. Builds a CF tree summary; clusters from that.
Use when: data is too large to hold in memory.
Choosing an algorithm
A decision flow:
1. **Embedding-based clustering, real-world corpora?** UMAP + HDBSCAN.
2. **Spherical clusters, large data, fast result?** K-means.
3. **Want hierarchy?** Agglomerative.
4. **Want soft assignment / probabilistic?** GMM.
5. **Variable density, shape, outliers?** HDBSCAN.
6. **Strange geometric shapes (rings, half-moons)?** Spectral.
7. **Streaming data?** BIRCH or online K-means.
For most production work in 2026: K-means for fast simple cases, HDBSCAN for real data complexity. The rest are specialised.
How to choose K
For algorithms requiring K:
- **Elbow method** — plot within-cluster variance vs K; pick the "elbow." Subjective.
- **Silhouette score** — measures cluster cohesion vs separation. Pick K that maximises.
- **Gap statistic** — compares to expected variance under null reference. More principled.
- **BIC / AIC** for GMM.
For clustering you can manually inspect: try a few K, look at the clusters, pick what looks right. Often the most useful approach.
Distance / similarity matters
The choice of distance metric affects cluster shapes:
- **Euclidean (L2)** — default; sensitive to scale.
- **Cosine** — angle between vectors; standard for embeddings.
- **Manhattan (L1)** — robust to outliers in individual dimensions.
- **Hamming** — for binary vectors.
- **Custom** — whatever your domain's notion of similarity is.
For high-dimensional data (especially embeddings), cosine is usually the right pick. For low-dimensional numeric features, Euclidean.
Always normalise / standardise features for distance-based methods. Otherwise the dimension with the largest scale dominates.
Failure modes
- **Curse of dimensionality.** All distances become similar in high dimensions; clustering loses signal. Reduce dimensionality first (UMAP, PCA).
- **Wrong assumptions.** Using K-means on non-spherical data; getting weird clusters; blaming the data when the algorithm is wrong.
- **Outliers eating clusters.** K-means is famously sensitive; use HDBSCAN or pre-remove outliers.
- **Ignoring noise.** Some points genuinely don't fit anywhere; HDBSCAN labels them; don't force them.
- **Not validating.** Numerical metrics (silhouette) only go so far; manual inspection of clusters is the real test.
Evaluation
Internal validation (no labels):
- Silhouette coefficient.
- Davies-Bouldin.
- Calinski-Harabasz.
External validation (with ground-truth labels):
- Adjusted Rand Index.
- Normalised Mutual Information.
- Purity.
For production: have a small labelled set; compute external metrics; trust them more than internal ones.
A pragmatic recipe
For a new clustering task:
1. **Visualise** the data first. Plot in 2D (PCA or UMAP). Eyeball the structure.
2. **Pick algorithm based on what you see**. Spherical blobs → K-means; complex shapes → HDBSCAN.
3. **Try with default parameters**. Inspect results manually.
4. **Tune incrementally**. One parameter at a time.
5. **Validate against domain experts** or labelled samples.
Most clustering tasks don't need fancy algorithms. They need sensible preprocessing, a reasonable algorithm, and human inspection of the output.
Further reading
- [DocumentClusteringApproaches]() — applied to text
- [LinearAlgebra]() — distance / projection math
- [BayesianReasoning]() — GMM and probabilistic clustering