BPE tokenizes text by iteratively merging the most frequent byte pairs
Image: Unknown authorUnknown author, Public domain, via Wikimedia Commons
BPE tokenizes text by iteratively merging the most frequent byte pairs
BPE tokenization does: iteratively merges the most frequent adjacent byte pairs
BPE tokenization merges frequent adjacent byte pairs iteratively
WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency
WordPiece tokenization splits words into subwords based on token likelihood rather than frequency
SentencePiece does differently from BPE: operates on raw text including whitespace
SentencePiece tokenizes text without pre-tokenization, preserving whitespace
subword tokenization solves: handles rare words by breaking into known pieces
Subword tokenization solves rare word handling by breaking into known pieces
[CLS] pooling does: uses the first token's embedding as the sentence representation
CLS pooling: uses the first token's embedding as the sentence representation
Unigram tokenization does: starts with large vocabulary and prunes using EM
Unigram tokenization starts with a large vocabulary and prunes using EM
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews