KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models

Image: BruceBlaus, CC BY 3.0, via Wikimedia Commons

KV-cache reduces redundant computation in autoregressive generation

KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models

Related concepts

GQA reduces KV-cache memory by the group factor

GQA reduces KV-cache memory by dividing storage by the number of groups

Tesla Model Y

Tesla Model Y is the world's best-selling electric vehicle in 2023

MoE models have more parameters but similar compute cost

MoE models distribute parameters across k experts, reducing active experts' compute cost

Overlapping subproblems

Dynamic programming solves overlapping subproblems by storing results of subproblems to avoid redundant calculations

gradient accumulation simulates larger batch sizes without more memory

Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights

CPU cache

L1/L2 cache hierarchy reduces global memory latency

Swipe through 100 ML concepts daily