to write a fused softmax kernel in Triton: load row, compute max, subtract, exp, sum, divide

`output = exp(row - max_val) / sum(exp(row - max_val))`

Related concepts

to write a vector addition kernel in Triton: load blocks, add, store

```

Softmax function

Softmax converts real numbers into a probability distribution

a Triton kernel is

Triton kernel: Python-based GPU programming that compiles to PTX

fused kernels do

Fused kernels combine multiple operations into one kernel to avoid memory round-trips

tl.dot does in Triton: block-level matrix multiply using tensor cores

tl.dot performs block-level matrix multiplication using tensor cores in Triton

kernel fusion reduces memory bandwidth bottleneck

Kernel fusion reduces memory bandwidth bottleneck by combining multiple operations into a single kernel, minimizing data transfers

Swipe through 100 ML concepts daily