tl.load and tl.store do in Triton: read/write tensors from/to GPU global memory

`tl.load` reads tensors from GPU memory; `tl.store` writes tensors to GPU memory

Related concepts

tl.dot does in Triton: block-level matrix multiply using tensor cores

tl.dot performs block-level matrix multiplication using tensor cores in Triton

to write a vector addition kernel in Triton: load blocks, add, store

```

a Triton kernel is

Triton kernel: Python-based GPU programming that compiles to PTX

a Triton @triton.jit decorator does: compiles a Python function into a GPU kernel

@triton.jit decorator compiles Python function into a GPU kernel

Triton differs from CUDA

Triton uses block-level programming, while CUDA uses thread-level programming

CPU cache

L1/L2 cache hierarchy reduces global memory latency

Swipe through 100 ML concepts daily