Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
Attention Is All You Need
O(n) complexity for long sequences
ring attention does: distributes long sequences across multiple devices
Ring attention distributes long sequences across multiple devices
Flashbulb memory
Flashbulb memories are vivid but not always accurate
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
paged attention (vLLM) improves serving throughput
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
Matrix multiplication algorithm
Tiling divides matrices into smaller blocks, loading them into shared memory for efficient matrix multiplication
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews