Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking

How the Transformer encoder differs from decoder — encoder is bidirectional, decoder is causal

Transformer encoder processes input bidirectionally, while decoder uses causal (left-to-right) masking

Related concepts

What causal masking does — prevents attention to future tokens in the decoder

Causal masking in transformer models prevents attention to future tokens in the decoder, preserving autoregressive property

How does attention mechanism in transformer models enhance language understanding and processing by dynamically weighting input tokens during sequence encoding?

Attention mechanisms assign dynamic weights to input tokens, enhancing contextual understanding and sequence processing in transformer models

Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates

Greedy decoding selects one token, while beam search retains multiple candidates

Why most transformer operations are memory-bound, not compute-bound

Transformer operations rely heavily on matrix multiplications, which are memory-intensive tasks

What 300-dim word2vec encodes: trained on word co-occurrence with skip-gram window

300-dim Word2Vec trained on word co-occurrence with skip-gram window

What a message queue decouples: producer and consumer can operate at different speeds

Message queues decouple producers and consumers, allowing asynchronous processing

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews