ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
Image: Cmichel67, CC BY-SA 4.0, via Wikimedia Commons
ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
cosine similarity is preferred over dot product for normalized embeddings
Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings
rotary position embeddings (RoPE) do
RoPE encodes relative position by applying rotation matrices to input features
weight tying does in language models: shares embedding and output projection matrices
Tying reduces the number of parameters by sharing embedding and output projection matrices
768-dim BERT embeddings capture: bidirectional context from masked language modeling
768-dim BERT embeddings capture bidirectional context from masked language modeling
List of algorithms
Cosine similarity measures the angle between vectors, not their magnitude
Proximal gradient methods for learning
Proximal gradient descent efficiently handles non-differentiable L1 regularization by combining gradient descent with a proximity operator
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews