Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle
Image: FMNLab, CC BY 4.0, via Wikimedia Commons
Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle
tensor cores are
Tensor cores are specialized hardware for matrix multiply-accumulate on GPU
tl.dot does in Triton: block-level matrix multiply using tensor cores
tl.dot performs block-level matrix multiplication using tensor cores in Triton
quantization to INT8 doubles throughput
Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster
Matrix multiplication algorithm
Tiling divides matrices into smaller blocks, loading them into shared memory for efficient matrix multiplication
Computational complexity of matrix multiplication
O(n³) naive matrix multiplication
instruction-level parallelism (ILP) achieves: multiple operations per clock cycle
Instruction-level parallelism (ILP) achieves: Multiple operations per clock cycle
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews