KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models
Image: BruceBlaus, CC BY 3.0, via Wikimedia Commons
KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models
GQA reduces KV-cache memory by the group factor
GQA reduces KV-cache memory by dividing storage by the number of groups
Tesla Model Y
Tesla Model Y is the world's best-selling electric vehicle in 2023
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
Overlapping subproblems
Dynamic programming solves overlapping subproblems by storing results of subproblems to avoid redundant calculations
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
CPU cache
L1/L2 cache hierarchy reduces global memory latency
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews