High dimensionality dilutes data density, making nearest neighbors less distinct and search unreliable

Image: Jkatz (WMF), CC BY-SA 4.0, via Wikimedia Commons

the curse of dimensionality makes nearest neighbor search unreliable

High dimensionality dilutes data density, making nearest neighbors less distinct and search unreliable

Related concepts

cosine similarity works better than Euclidean distance in high dimensions

Cosine similarity measures orientation, not magnitude, making it more robust to irrelevant dimensions in high-dimensional spaces

random projection to O(log n/ε²) dimensions preserves pairwise distances within 1±ε

Random projection reduces dimensionality while preserving pairwise distances within ε² due to the Johnson-Lindenstrauss lemma

the Johnson-Lindenstrauss lemma says

Random projection reduces dimensionality while approximately preserving pairwise distances

Locality-sensitive hashing

Locality-sensitive hashing (LSH) hashes similar items into the same buckets

Manifold hypothesis

High-dimensional data lies on lower-dimensional manifolds

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

Swipe through 100 ML concepts daily