`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`

How to write a fused softmax kernel in Triton: load row, compute max, subtract, exp, sum, divide

`fused_softmax_kernel(input, output): row_max = max_pool2d(input, row_length); exp_diff = exp(input - row_max); softmax_sum = sum(exp_diff, axis=1); output = exp_diff / softmax_sum`

Ask Claude to explain

Related concepts

Write the attention score formula before softmax: e_ij = a(s_i, h_j)

Attention score formula: e_ij = a(s_i, h_j) = exp(tanh(W_s * s_i + W_h * h_j + b))

What a CUDA kernel is — a function that runs on thousands of GPU threads in parallel

CUDA kernel: Parallel function executed on GPU's thousands of threads simultaneously

What operator fusion does at the compiler level: merges adjacent ops to reduce memory traffic

Operator fusion optimizes code by combining adjacent operations into a single instruction, minimizing memory access

What LoRA does — adds trainable low-rank matrices A and B where ΔW = BA

LoRA: Augments model weights with low-rank matrices A, B, ΔW = BA

What tl.arange(0, BLOCK_SIZE) creates: a range of indices within the current block

`np.arange(0, BLOCK_SIZE)` generates an array of indices from 0 to BLOCK_SIZE-1

Write the equation for cross-entropy loss

H(y, p) = -Σ(y_i * log(p_i)) for all i

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews