the vocabulary size matters: larger vocab = shorter sequences but more parameters

Larger vocab reduces sequence length, increasing model complexity and parameters

Related concepts

weight tying does in language models: shares embedding and output projection matrices

Tying reduces the number of parameters by sharing embedding and output projection matrices

Attention Is All You Need

O(n) complexity for long sequences

batch size affects generalization: larger batches find sharper minima

Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates

WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency

WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers

All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers

soft targets carry more information than hard labels: they encode class similarities

Soft targets carry more information than hard labels because they encode class similarities

Swipe through 100 ML concepts daily