GQA reduces KV-cache memory by dividing storage by the number of groups
Image: Swilsonmc, CC BY-SA 3.0, via Wikimedia Commons
GQA reduces KV-cache memory by dividing storage by the number of groups
CPU cache
L1/L2 cache hierarchy reduces global memory latency
grouped query attention (GQA) does
GQA shares KV heads across multiple Q heads for efficient parameter usage
paged attention (vLLM) improves serving throughput
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
KV-cache reduces redundant computation in autoregressive generation
KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models
kernel fusion reduces memory bandwidth bottleneck
Kernel fusion reduces memory bandwidth bottleneck by combining multiple operations into a single kernel, minimizing data transfers
Triton auto-tunes BLOCK_SIZE: different sizes optimize for different hardware
Triton auto-tunes BLOCK_SIZE for hardware efficiency, optimizing memory access patterns and computational throughput
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews