GQA reduces KV-cache memory by the group factor

GQA reduces KV-cache memory by dividing storage by the number of groups

Related concepts

CPU cache

L1/L2 cache hierarchy reduces global memory latency

grouped query attention (GQA) does

GQA shares KV heads across multiple Q heads for efficient parameter usage

paged attention (vLLM) improves serving throughput

Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval

KV-cache reduces redundant computation in autoregressive generation

KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models

kernel fusion reduces memory bandwidth bottleneck

Kernel fusion reduces memory bandwidth bottleneck by combining multiple operations into a single kernel, minimizing data transfers

Triton auto-tunes BLOCK_SIZE: different sizes optimize for different hardware

Triton auto-tunes BLOCK_SIZE for hardware efficiency, optimizing memory access patterns and computational throughput

Swipe through 100 ML concepts daily