Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval

Image: Daniel Voigt Godoy, CC BY 4.0, via Wikimedia Commons

paged attention (vLLM) improves serving throughput

Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval

Related concepts

Attention Is All You Need

O(n) complexity for long sequences

GQA reduces KV-cache memory by the group factor

GQA reduces KV-cache memory by dividing storage by the number of groups

Attention (machine learning)

Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization

grouped query attention (GQA) does

GQA shares KV heads across multiple Q heads for efficient parameter usage

ring attention does: distributes long sequences across multiple devices

Ring attention distributes long sequences across multiple devices

Flashbulb memory

Flashbulb memories are vivid but not always accurate

Swipe through 100 ML concepts daily