BPE tokenization does: iteratively merges the most frequent byte pairs

BPE tokenizes text by iteratively merging the most frequent byte pairs

Related concepts

BPE tokenization does: iteratively merges the most frequent adjacent byte pairs

BPE tokenization merges frequent adjacent byte pairs iteratively

WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency

WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

SentencePiece does differently from BPE: operates on raw text including whitespace

SentencePiece tokenizes text without pre-tokenization, preserving whitespace

subword tokenization solves: handles rare words by breaking into known pieces

Subword tokenization solves rare word handling by breaking into known pieces

[CLS] pooling does: uses the first token's embedding as the sentence representation

CLS pooling: uses the first token's embedding as the sentence representation

Unigram tokenization does: starts with large vocabulary and prunes using EM

Unigram tokenization starts with a large vocabulary and prunes using EM

Swipe through 100 ML concepts daily