tl.dot performs block-level matrix multiplication using tensor cores in Triton
Image: Daniel L. Lu (user:dllu), CC BY-SA 4.0, via Wikimedia Commons
tl.dot performs block-level matrix multiplication using tensor cores in Triton
tensor cores do 4x4 matrix multiply in one clock cycle
Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle
tl.load and tl.store do in Triton: read/write tensors from/to GPU global memory
`tl.load` reads tensors from GPU memory; `tl.store` writes tensors to GPU memory
a Triton kernel is
Triton kernel: Python-based GPU programming that compiles to PTX
Triton differs from CUDA
Triton uses block-level programming, while CUDA uses thread-level programming
to write a vector addition kernel in Triton: load blocks, add, store
```
tensor cores are
Tensor cores are specialized hardware for matrix multiply-accumulate on GPU
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews