Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
Image: Daniel Voigt Godoy, CC BY 4.0, via Wikimedia Commons
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
Attention Is All You Need
O(n) complexity for long sequences
GQA reduces KV-cache memory by the group factor
GQA reduces KV-cache memory by dividing storage by the number of groups
Attention (machine learning)
Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
grouped query attention (GQA) does
GQA shares KV heads across multiple Q heads for efficient parameter usage
ring attention does: distributes long sequences across multiple devices
Ring attention distributes long sequences across multiple devices
Flashbulb memory
Flashbulb memories are vivid but not always accurate
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews