
BLOCK_SIZE in Triton refers to the size of the data chunk processed by each program instance
Image: US Border Patrol, Public domain, via Wikimedia Commons
BLOCK_SIZE in Triton refers to the size of the data chunk processed by each program instance
Triton differs from CUDA
Triton uses block-level programming, while CUDA uses thread-level programming
Triton auto-tunes BLOCK_SIZE: different sizes optimize for different hardware
Triton auto-tunes BLOCK_SIZE for hardware efficiency, optimizing memory access patterns and computational throughput
tl.arange(0, BLOCK_SIZE) creates: a range of indices within the current block
`tl.arange(0, BLOCK_SIZE)` creates a range of indices from 0 to BLOCK_SIZE-1
tl.dot does in Triton: block-level matrix multiply using tensor cores
tl.dot performs block-level matrix multiplication using tensor cores in Triton
a Triton kernel is
Triton kernel: Python-based GPU programming that compiles to PTX
tl.load and tl.store do in Triton: read/write tensors from/to GPU global memory
`tl.load` reads tensors from GPU memory; `tl.store` writes tensors to GPU memory
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews