Matryoshka embeddings: Trained to be useful at multiple truncated dimensions
Image: Vassily Kandinsky by Adolf Elnain Photo credits : Georges Meguerditchian - Centre Pompidou, MNAM-CCI /Dist. RMN-GP Imag, Public domain, via Wikimedia Commons
Matryoshka embeddings: Trained to be useful at multiple truncated dimensions
weight tying does in language models: shares embedding and output projection matrices
Tying reduces the number of parameters by sharing embedding and output projection matrices
random projection to O(log n/ε²) dimensions preserves pairwise distances within 1±ε
Random projection reduces dimensionality while preserving pairwise distances within ε² due to the Johnson-Lindenstrauss lemma
ALiBi allows length extrapolation better than learned position embeddings
ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
the Johnson-Lindenstrauss lemma says
Random projection reduces dimensionality while approximately preserving pairwise distances
cosine similarity is preferred over dot product for normalized embeddings
Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews