
Tiling divides matrices into smaller blocks, loading them into shared memory for efficient matrix multiplication
Image: Mnbayazit, Public domain, via Wikimedia Commons
Tiling divides matrices into smaller blocks, loading them into shared memory for efficient matrix multiplication
Computational complexity of matrix multiplication
O(n³) naive matrix multiplication
Attention (machine learning)
Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
Triton auto-tunes BLOCK_SIZE: different sizes optimize for different hardware
Triton auto-tunes BLOCK_SIZE for hardware efficiency, optimizing memory access patterns and computational throughput
Overlapping subproblems
Dynamic programming solves overlapping subproblems by storing results of subproblems to avoid redundant calculations
tensor cores do 4x4 matrix multiply in one clock cycle
Tensor cores perform 4x4 matrix multiply using optimized GEMM (General Matrix Multiply) instructions in one clock cycle
Dynamic random-access memory
DRAM requires periodic refreshing to maintain data integrity
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews