Transformers use multi-head attention for contextualizing tokens
Transformers use multi-head attention for contextualizing tokens
In transformers, each token is contextualized through a multi-head attention mechanism. This allows the model to focus on different parts of the input sequence simultaneously, enhancing the representation of each token by considering its context.
Example
For a sentence like "The cat sat on the mat," each word token (e.g., "cat," "sat," "mat") is contextualized by considering its relationship with other words in the sentence.
Understanding this helps grasp how transformers achieve efficient and effective language modeling.
Pre-LN transformers are easier to train
Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation
the embedding layer does: maps discrete token IDs to dense learned vectors
Embeddings convert token IDs to dense vectors for neural network processing
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
ring attention does: distributes long sequences across multiple devices
Ring attention distributes long sequences across multiple devices
most transformer operations are memory-bound, not compute-bound
Most transformer operations are memory-bound due to large model sizes requiring extensive data transfer
transformers use LayerNorm not BatchNorm
LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews