SentencePiece does differently from BPE: operates on raw text including whitespace

SentencePiece tokenizes text without pre-tokenization, preserving whitespace

Related concepts

WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency

WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

BPE tokenization does: iteratively merges the most frequent byte pairs

BPE tokenizes text by iteratively merging the most frequent byte pairs

BPE tokenization does: iteratively merges the most frequent adjacent byte pairs

BPE tokenization merges frequent adjacent byte pairs iteratively

[CLS] pooling does: uses the first token's embedding as the sentence representation

CLS pooling: uses the first token's embedding as the sentence representation

subword tokenization solves: handles rare words by breaking into known pieces

Subword tokenization solves rare word handling by breaking into known pieces

the tokenizer's special tokens do: [CLS], [SEP], [PAD], [MASK] have specific roles

[CLS] marks the start of input, [SEP] denotes separation, [PAD] fills space, [MASK] hides words for prediction

Swipe through 100 ML concepts daily