KV-cache minimizes redundant computations by storing intermediate results in autoregressive models
KV-cache minimizes redundant computations by storing intermediate results in autoregressive models
Why memory coalescing matters — adjacent threads reading adjacent memory addresses
Memory coalescing reduces cache misses, improving multithreaded application performance
How tiling works in matrix multiplication — loading blocks into shared memory
Tiling in matrix multiplication optimizes cache usage by partitioning matrices into submatrices
What causal masking does — prevents attention to future tokens in the decoder
Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property
Why most transformer operations are memory-bound, not compute-bound
Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks
What consistent hashing does: minimizes remapping when nodes join/leave
Consistent hashing minimizes data redistribution during nodes' addition or removal
What LSM trees optimize: write-heavy workloads by buffering writes in memory
LSM trees optimize write-heavy workloads through in-memory buffering
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews