Multi-query attention (MQA) with shared KV head: Q heads share a single KV head for efficient parameter usage
Image: Metalicat, CC0, via Wikimedia Commons
Multi-query attention (MQA) with shared KV head: Q heads share a single KV head for efficient parameter usage
grouped query attention (GQA) does
GQA shares KV heads across multiple Q heads for efficient parameter usage
Write the multi-head attention formula: MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O
paged attention (vLLM) improves serving throughput
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
self-attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V
Attention(Q,K,V) = softmax(QK^T/√d_k)V
structured pruning removes: entire filters or attention heads, not individual weights
Structured pruning removes entire filters or attention heads, not individual weights
ring attention does: distributes long sequences across multiple devices
Ring attention distributes long sequences across multiple devices
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews