self-attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Related concepts

Write the attention score formula before softmax: e_ij = a(s_i, h_j)

Attention score formula: e_ij = softmax(a(s_i, h_j))

Write the multi-head attention formula: MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O

MultiHead(Q,K,V) = Concat(head_i=MultiHeadAttention(Q,K,V)_i)W^O

Write the Bellman equation for reinforcement learning

Bellman equation: V(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V(s')]

Softmax function

Softmax converts real numbers into a probability distribution

Attention Is All You Need

O(n) complexity for long sequences

Entropy (information theory)

H(X) = −∑x∈X p(x) log(p(x))

Swipe through 100 ML concepts daily