ring attention does: distributes long sequences across multiple devices

Ring attention distributes long sequences across multiple devices

Related concepts

Attention Is All You Need

O(n) complexity for long sequences

Attention (machine learning)

Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

paged attention (vLLM) improves serving throughput

Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval

grouped query attention (GQA) does

GQA shares KV heads across multiple Q heads for efficient parameter usage

Reasoning model

RLMs excel in logic, math, and programming tasks

Swipe through 100 ML concepts daily