Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle

Image: FMNLab, CC BY 4.0, via Wikimedia Commons

tensor cores do 4x4 matrix multiply in one clock cycle

Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle

Related concepts

tensor cores are

Tensor cores are specialized hardware for matrix multiply-accumulate on GPU

tl.dot does in Triton: block-level matrix multiply using tensor cores

tl.dot performs block-level matrix multiplication using tensor cores in Triton

quantization to INT8 doubles throughput

Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster

Matrix multiplication algorithm

Tiling divides matrices into smaller blocks, loading them into shared memory for efficient matrix multiplication

Computational complexity of matrix multiplication

O(n³) naive matrix multiplication

instruction-level parallelism (ILP) achieves: multiple operations per clock cycle

Instruction-level parallelism (ILP) achieves: Multiple operations per clock cycle

Swipe through 100 ML concepts daily