MQA: Multi-query attention with shared key-value head for efficient cross-query processing

What multi-query attention (MQA) is — all Q heads share a single KV head

MQA: Multi-query attention with shared key-value head for efficient cross-query processing

Related concepts

How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?

Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models

What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic

Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access

Why attention is O(n²) in sequence length: every token attends to every other token

Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity

What AWQ does differently — activation-aware weight quantization preserves important weights

AWQ quantizes weights while preserving critical activation values for neural network efficiency

Write the attention score formula before softmax: e_ij = a(s_i, h_j)

Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))

What database sharding does: splits data across machines by a partition key

Database sharding distributes data across multiple machines using a partition key for scalability and performance

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews