SentencePiece tokenizes text without pre-tokenization, preserving whitespace
Image: Los Angeles Times, CC BY 4.0, via Wikimedia Commons
SentencePiece tokenizes text without pre-tokenization, preserving whitespace
WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency
WordPiece tokenization splits words into subwords based on token likelihood rather than frequency
BPE tokenization does: iteratively merges the most frequent byte pairs
BPE tokenizes text by iteratively merging the most frequent byte pairs
BPE tokenization does: iteratively merges the most frequent adjacent byte pairs
BPE tokenization merges frequent adjacent byte pairs iteratively
[CLS] pooling does: uses the first token's embedding as the sentence representation
CLS pooling: uses the first token's embedding as the sentence representation
subword tokenization solves: handles rare words by breaking into known pieces
Subword tokenization solves rare word handling by breaking into known pieces
the tokenizer's special tokens do: [CLS], [SEP], [PAD], [MASK] have specific roles
[CLS] marks the start of input, [SEP] denotes separation, [PAD] fills space, [MASK] hides words for prediction
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews