@triton.jit decorator compiles Python function into a GPU kernel
Image: Lucasbosch, CC BY-SA 3.0, via Wikimedia Commons
@triton.jit decorator compiles Python function into a GPU kernel
a Triton kernel is
Triton kernel: Python-based GPU programming that compiles to PTX
Triton differs from CUDA
Triton uses block-level programming, while CUDA uses thread-level programming
tl.load and tl.store do in Triton: read/write tensors from/to GPU global memory
`tl.load` reads tensors from GPU memory; `tl.store` writes tensors to GPU memory
tl.dot does in Triton: block-level matrix multiply using tensor cores
tl.dot performs block-level matrix multiplication using tensor cores in Triton
to write a vector addition kernel in Triton: load blocks, add, store
```
Arm architecture family
ARM processors are the most widely used family of instruction set architectures
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews