XLA compiles computation graphs for TPU/GPU execution
Image: GuavaTrain, CC0, via Wikimedia Commons
XLA compiles computation graphs for TPU/GPU execution
a Triton kernel is
Triton kernel: Python-based GPU programming that compiles to PTX
tensor cores are
Tensor cores are specialized hardware for matrix multiply-accumulate on GPU
tl.load and tl.store do in Triton: read/write tensors from/to GPU global memory
`tl.load` reads tensors from GPU memory; `tl.store` writes tensors to GPU memory
torch.compile does in PyTorch 2.0: traces and optimizes the computation graph
torch.compile optimizes computation graph by tracing and compiling it for efficiency
a Triton @triton.jit decorator does: compiles a Python function into a GPU kernel
@triton.jit decorator compiles Python function into a GPU kernel
Arm architecture family
ARM processors are the most widely used family of instruction set architectures
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews