RoPE (Relative Position Encoding) advantage: supports length extrapolation beyond training context length
Image: Glabb, CC BY-SA 3.0, via Wikimedia Commons
RoPE (Relative Position Encoding) advantage: supports length extrapolation beyond training context length
rotary position embeddings (RoPE) do
RoPE encodes relative position by applying rotation matrices to input features
RoPE encodes position: multiply Q,K by rotation matrix R(θ_i) at each position
RoPE encodes position by multiplying Q,K by R(θ_i) at each position
ALiBi allows length extrapolation better than learned position embeddings
ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
weight tying does in language models: shares embedding and output projection matrices
Tying reduces the number of parameters by sharing embedding and output projection matrices
loop unrolling does: trades code size for reduced loop overhead
Loop unrolling reduces loop overhead by executing multiple iterations simultaneously, increasing code size
Vanishing gradient problem
Residual connections help by allowing gradient flow through the skip connection
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews