Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))
Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))
How to write a fused softmax kernel in Triton: load row, compute max, subtract, exp, sum, divide
`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`
Why temperature T in softmax(x/T) controls entropy: T→0 is argmax, T→∞ is uniform
As T approaches zero, softmax becomes argmax, maximizing entropy; T→∞ yields uniform distribution, minimizing entropy
What multi-query attention (MQA) is — all Q heads share a single KV head
MQA: Multi-query attention with shared key-value head for efficient cross-query processing
Write the equation for cross-entropy loss
H(y, p) = -Σ(y_i * log(p_i)) for all i
Why attention is O(n²) in sequence length: every token attends to every other token
Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity
How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?
Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews