MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O
Image: Mushki Brichta, CC BY-SA 4.0, via Wikimedia Commons
MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O
self-attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V
Attention(Q,K,V) = softmax(QK^T/√d_k)V
multi-query attention (MQA) is
Multi-query attention (MQA) with shared KV head: Q heads share a single KV head for efficient parameter usage
Write the attention score formula before softmax: e_ij = a(s_i, h_j)
Attention score formula: e_ij = softmax(a(s_i, h_j))
grouped query attention (GQA) does
GQA shares KV heads across multiple Q heads for efficient parameter usage
convolution (f * g)(t) = ∫f(τ)g(t-τ)dτ
(f * g)(t) = ∫f(τ)g(t-τ)dτ
Mutual information
Mutual information formula: I(X;Y) = ∑_x∈X ∑_y∈Y p(x,y) log(p(x,y)/(p(x)p(y)))
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews