[CLS] marks the start of input, [SEP] denotes separation, [PAD] fills space, [MASK] hides words for prediction

Image: William Blake, No restrictions, via Wikimedia Commons

the tokenizer's special tokens do: [CLS], [SEP], [PAD], [MASK] have specific roles

[CLS] marks the start of input, [SEP] denotes separation, [PAD] fills space, [MASK] hides words for prediction

Related concepts

subword tokenization solves: handles rare words by breaking into known pieces

Subword tokenization solves rare word handling by breaking into known pieces

Masking (behavior)

Causal masking prevents attention to future tokens in the decoder

Unigram tokenization does: starts with large vocabulary and prunes using EM

Unigram tokenization starts with a large vocabulary and prunes using EM

[CLS] pooling does: uses the first token's embedding as the sentence representation

CLS pooling: uses the first token's embedding as the sentence representation

WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency

WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

Large language model

LLMs can generate, summarize, translate, and analyze text in many contexts

Swipe through 100 ML concepts daily