WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

Image: Re-cropped derivative work: Burn t (talk) Burroughs1983_cropped.jpg: Chuck Patch, CC BY-SA 2.0, via Wikimedia Commons

WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency

WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

Related concepts

subword tokenization solves: handles rare words by breaking into known pieces

Subword tokenization solves rare word handling by breaking into known pieces

SentencePiece does differently from BPE: operates on raw text including whitespace

SentencePiece tokenizes text without pre-tokenization, preserving whitespace

BPE tokenization does: iteratively merges the most frequent byte pairs

BPE tokenizes text by iteratively merging the most frequent byte pairs

BPE tokenization does: iteratively merges the most frequent adjacent byte pairs

BPE tokenization merges frequent adjacent byte pairs iteratively

the vocabulary size matters: larger vocab = shorter sequences but more parameters

Larger vocab reduces sequence length, increasing model complexity and parameters

Unigram tokenization does: starts with large vocabulary and prunes using EM

Unigram tokenization starts with a large vocabulary and prunes using EM

Swipe through 100 ML concepts daily