Causal masking prevents attention to future tokens in the decoder

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

Related concepts

Attention (machine learning)

Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization

Large language model

LLMs can generate, summarize, translate, and analyze text in many contexts

Attention Is All You Need

O(n) complexity for long sequences

gradient checkpointing trades: recomputes activations to save memory

Gradient checkpointing trades off computation time for memory savings by recomputing activations

the tokenizer's special tokens do: [CLS], [SEP], [PAD], [MASK] have specific roles

[CLS] marks the start of input, [SEP] denotes separation, [PAD] fills space, [MASK] hides words for prediction

soft targets carry more information than hard labels: they encode class similarities

Soft targets carry more information than hard labels because they encode class similarities

Swipe through 100 ML concepts daily