Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks

Why most transformer operations are memory-bound, not compute-bound

Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks

Related concepts

How tiling works in matrix multiplication — loading blocks into shared memory

Tiling in matrix multiplication optimizes cache usage by partitioning matrices into submatrices

How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?

Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models

What causal masking does — prevents attention to future tokens in the decoder

Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property

How KV-cache reduces redundant computation in autoregressive generation

KV-cache minimizes redundant computations by storing intermediate results in autoregressive models

How the Transformer encoder differs from decoder — encoder is bidirectional, decoder is causal

Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking

Why memory coalescing matters — adjacent threads reading adjacent memory addresses

Memory coalescing reduces cache misses, improving multithreaded application performance

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews