Encoder: Sees all tokens bidirectionally; Decoder: Sees only past tokens
Image: Whispyhistory, CC BY-SA 4.0, via Wikimedia Commons
Encoder: Sees all tokens bidirectionally; Decoder: Sees only past tokens
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
Greedy vs beam search decoding: greedy picks best token, beam maintains k candidates
Greedy picks best token, beam maintains k candidates
BPE tokenization does: iteratively merges the most frequent adjacent byte pairs
BPE tokenization merges frequent adjacent byte pairs iteratively
[CLS] pooling does: uses the first token's embedding as the sentence representation
CLS pooling: uses the first token's embedding as the sentence representation
BPE tokenization does: iteratively merges the most frequent byte pairs
BPE tokenizes text by iteratively merging the most frequent byte pairs
Large language model
LLMs can generate, summarize, translate, and analyze text in many contexts
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews