`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`
`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`
Write the attention score formula before softmax: e_ij = a(s_i, h_j)
Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))
What a CUDA kernel is — a function that runs on thousands of GPU threads in parallel
CUDA kernel: Parallel function executed on GPU's thousands of threads simultaneously
What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic
Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access
What LoRA does — adds trainable low-rank matrices A and B where ΔW = BA
LoRA: Augments model weights with low-rank matrices A, B, ΔW = BA
What tl.arange(0, BLOCK_SIZE) creates: a range of indices within the current block
`np.arange(0, BLOCK_SIZE)` generates an array of indices from 0 to BLOCK_SIZE-1
Write the equation for cross-entropy loss
H(y, p) = -Σ(y_i * log(p_i)) for all i
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews