Causal masking prevents attention to future tokens in the decoder
Causal masking prevents attention to future tokens in the decoder
Attention (machine learning)
Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
Large language model
LLMs can generate, summarize, translate, and analyze text in many contexts
Attention Is All You Need
O(n) complexity for long sequences
gradient checkpointing trades: recomputes activations to save memory
Gradient checkpointing trades off computation time for memory savings by recomputing activations
the tokenizer's special tokens do: [CLS], [SEP], [PAD], [MASK] have specific roles
[CLS] marks the start of input, [SEP] denotes separation, [PAD] fills space, [MASK] hides words for prediction
soft targets carry more information than hard labels: they encode class similarities
Soft targets carry more information than hard labels because they encode class similarities
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews