Triton differs from CUDA

Triton uses block-level programming, while CUDA uses thread-level programming

Related concepts

a Triton kernel is

Triton kernel: Python-based GPU programming that compiles to PTX

tl.dot does in Triton: block-level matrix multiply using tensor cores

tl.dot performs block-level matrix multiplication using tensor cores in Triton

Thread block (CUDA programming)

Thread blocks can contain up to 1024 threads as of March 2010

BLOCK_SIZE means in Triton: the tile size each program instance processes

BLOCK_SIZE in Triton refers to the size of the data chunk processed by each program instance

a Triton @triton.jit decorator does: compiles a Python function into a GPU kernel

@triton.jit decorator compiles Python function into a GPU kernel

Triton auto-tunes BLOCK_SIZE: different sizes optimize for different hardware

Triton auto-tunes BLOCK_SIZE for hardware efficiency, optimizing memory access patterns and computational throughput

Swipe through 100 ML concepts daily