Transformers use multi-head attention for contextualizing tokens

Transformer (deep learning)

Transformers use multi-head attention for contextualizing tokens

In transformers, each token is contextualized through a multi-head attention mechanism. This allows the model to focus on different parts of the input sequence simultaneously, enhancing the representation of each token by considering its context.

Example

For a sentence like "The cat sat on the mat," each word token (e.g., "cat," "sat," "mat") is contextualized by considering its relationship with other words in the sentence.

Understanding this helps grasp how transformers achieve efficient and effective language modeling.

Related concepts

Pre-LN transformers are easier to train

Pre-LN transformers use residual connections, allowing gradients to flow more smoothly during backpropagation

the embedding layer does: maps discrete token IDs to dense learned vectors

Embeddings convert token IDs to dense vectors for neural network processing

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

ring attention does: distributes long sequences across multiple devices

Ring attention distributes long sequences across multiple devices

most transformer operations are memory-bound, not compute-bound

Most transformer operations are memory-bound due to large model sizes requiring extensive data transfer

transformers use LayerNorm not BatchNorm

LayerNorm normalizes across all features, accommodating variable-length sequences unlike BatchNorm, which relies on fixed-size batches

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews