
WordPiece tokenization splits words into subwords based on token likelihood rather than frequency
Image: Re-cropped derivative work: Burn t (talk) Burroughs1983_cropped.jpg: Chuck Patch, CC BY-SA 2.0, via Wikimedia Commons
WordPiece tokenization splits words into subwords based on token likelihood rather than frequency
subword tokenization solves: handles rare words by breaking into known pieces
Subword tokenization solves rare word handling by breaking into known pieces
SentencePiece does differently from BPE: operates on raw text including whitespace
SentencePiece tokenizes text without pre-tokenization, preserving whitespace
BPE tokenization does: iteratively merges the most frequent byte pairs
BPE tokenizes text by iteratively merging the most frequent byte pairs
BPE tokenization does: iteratively merges the most frequent adjacent byte pairs
BPE tokenization merges frequent adjacent byte pairs iteratively
the vocabulary size matters: larger vocab = shorter sequences but more parameters
Larger vocab reduces sequence length, increasing model complexity and parameters
Unigram tokenization does: starts with large vocabulary and prunes using EM
Unigram tokenization starts with a large vocabulary and prunes using EM
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews