Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization

Attention (machine learning)

Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization

Related concepts

Attention Is All You Need

O(n) complexity for long sequences

ring attention does: distributes long sequences across multiple devices

Ring attention distributes long sequences across multiple devices

Flashbulb memory

Flashbulb memories are vivid but not always accurate

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

paged attention (vLLM) improves serving throughput

Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval

Matrix multiplication algorithm

Tiling divides matrices into smaller blocks, loading them into shared memory for efficient matrix multiplication

Swipe through 100 ML concepts daily