Write the multi-head attention formula: MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O

MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O

Related concepts

self-attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V

Attention(Q,K,V) = softmax(QK^T/√d_k)V

multi-query attention (MQA) is

Multi-query attention (MQA) with shared KV head: Q heads share a single KV head for efficient parameter usage

Write the attention score formula before softmax: e_ij = a(s_i, h_j)

Attention score formula: e_ij = softmax(a(s_i, h_j))

grouped query attention (GQA) does

GQA shares KV heads across multiple Q heads for efficient parameter usage

convolution (f * g)(t) = ∫f(τ)g(t-τ)dτ

(f * g)(t) = ∫f(τ)g(t-τ)dτ

Mutual information

Mutual information formula: I(X;Y) = ∑_x∈X ∑_y∈Y p(x,y) log(p(x,y)/(p(x)p(y)))

Swipe through 100 ML concepts daily