
Ring attention distributes long sequences across multiple devices
Image: 1993MR2Turbo, CC BY-SA 3.0, via Wikimedia Commons
Ring attention distributes long sequences across multiple devices
Attention Is All You Need
O(n) complexity for long sequences
Attention (machine learning)
Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
paged attention (vLLM) improves serving throughput
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
grouped query attention (GQA) does
GQA shares KV heads across multiple Q heads for efficient parameter usage
Reasoning model
RLMs excel in logic, math, and programming tasks
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews