KV-cache minimizes redundant computations by storing intermediate results in autoregressive models

How KV-cache reduces redundant computation in autoregressive generation

KV-cache minimizes redundant computations by storing intermediate results in autoregressive models

Related concepts

Why memory coalescing matters — adjacent threads reading adjacent memory addresses

Memory coalescing reduces cache misses, improving multithreaded application performance

How tiling works in matrix multiplication — loading blocks into shared memory

Tiling in matrix multiplication optimizes cache usage by partitioning matrices into submatrices

What causal masking does — prevents attention to future tokens in the decoder

Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property

Why most transformer operations are memory-bound, not compute-bound

Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks

What consistent hashing does: minimizes remapping when nodes join/leave

Consistent hashing minimizes data redistribution during nodes' addition or removal

What LSM trees optimize: write-heavy workloads by buffering writes in memory

LSM trees optimize write-heavy workloads through in-memory buffering

Swipe through 100 ML concepts daily