
Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
Image: GruenerBogen, CC BY-SA 4.0, via Wikimedia Commons
Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
Write the attention score formula before softmax: e_ij = a(s_i, h_j)
Attention score formula: e_ij = softmax(a(s_i, h_j))
Write the multi-head attention formula: MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O
Write the Bellman equation for reinforcement learning
Bellman equation: V(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V(s')]
Softmax function
Softmax converts real numbers into a probability distribution
Attention Is All You Need
O(n) complexity for long sequences
Entropy (information theory)
H(X) = −∑x∈X p(x) log(p(x))
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews