
Triton uses block-level programming, while CUDA uses thread-level programming
Image: CC BY-SA 2.5, via Wikimedia Commons
Triton uses block-level programming, while CUDA uses thread-level programming
a Triton kernel is
Triton kernel: Python-based GPU programming that compiles to PTX
tl.dot does in Triton: block-level matrix multiply using tensor cores
tl.dot performs block-level matrix multiplication using tensor cores in Triton
Thread block (CUDA programming)
Thread blocks can contain up to 1024 threads as of March 2010
BLOCK_SIZE means in Triton: the tile size each program instance processes
BLOCK_SIZE in Triton refers to the size of the data chunk processed by each program instance
a Triton @triton.jit decorator does: compiles a Python function into a GPU kernel
@triton.jit decorator compiles Python function into a GPU kernel
Triton auto-tunes BLOCK_SIZE: different sizes optimize for different hardware
Triton auto-tunes BLOCK_SIZE for hardware efficiency, optimizing memory access patterns and computational throughput
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews