ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences

Image: Cmichel67, CC BY-SA 4.0, via Wikimedia Commons

ALiBi allows length extrapolation better than learned position embeddings

ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences

Related concepts

cosine similarity is preferred over dot product for normalized embeddings

Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings

rotary position embeddings (RoPE) do

RoPE encodes relative position by applying rotation matrices to input features

weight tying does in language models: shares embedding and output projection matrices

Tying reduces the number of parameters by sharing embedding and output projection matrices

768-dim BERT embeddings capture: bidirectional context from masked language modeling

768-dim BERT embeddings capture bidirectional context from masked language modeling

List of algorithms

Cosine similarity measures the angle between vectors, not their magnitude

Proximal gradient methods for learning

Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator

Swipe through 100 ML concepts daily