`tl.arange(0, BLOCK_SIZE)` creates a range of indices from 0 to BLOCK_SIZE-1
Image: U.S. Department of the Interior National Park Service Natural Resource Stewardship and Science Fort Collins, Colorado, Public domain, via Wikimedia Commons
`tl.arange(0, BLOCK_SIZE)` creates a range of indices from 0 to BLOCK_SIZE-1
tl.program_id(0) returns: the index of the current parallel block
tl.program_id(0) returns: the index of the current parallel block
BLOCK_SIZE means in Triton: the tile size each program instance processes
BLOCK_SIZE in Triton refers to the size of the data chunk processed by each program instance
tl.where(mask, x, 0) does: conditional select to handle boundary conditions
`tl.where(mask, x, 0) = x if mask else 0`
tl.dot does in Triton: block-level matrix multiply using tensor cores
tl.dot performs block-level matrix multiplication using tensor cores in Triton
tl.load and tl.store do in Triton: read/write tensors from/to GPU global memory
`tl.load` reads tensors from GPU memory; `tl.store` writes tensors to GPU memory
Effect size
Cohen's D benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large effect
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews