Multi-query attention (MQA) with shared KV head: Q heads share a single KV head for efficient parameter usage

Image: Metalicat, CC0, via Wikimedia Commons

multi-query attention (MQA) is

Multi-query attention (MQA) with shared KV head: Q heads share a single KV head for efficient parameter usage

Related concepts

grouped query attention (GQA) does

GQA shares KV heads across multiple Q heads for efficient parameter usage

Write the multi-head attention formula: MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O

MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O

paged attention (vLLM) improves serving throughput

Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval

self-attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V

Attention(Q,K,V) = softmax(QK^T/√d_k)V

structured pruning removes: entire filters or attention heads, not individual weights

Structured pruning removes entire filters or attention heads, not individual weights

ring attention does: distributes long sequences across multiple devices

Ring attention distributes long sequences across multiple devices

Swipe through 100 ML concepts daily