Context window limit: maximum tokens model processes simultaneously

Image: Santa Clara Valley Transportation Authority, Public domain, via Wikimedia Commons

Glossary of poker terms

Context window limit: maximum tokens model processes simultaneously

Related concepts

the compute-optimal training ratio is: roughly 20 tokens per parameter

Compute-optimal training ratio: roughly 20 tokens per parameter

the vocabulary size matters: larger vocab = shorter sequences but more parameters

Larger vocab reduces sequence length, increasing model complexity and parameters

Neural scaling law

Chinchilla scaling law: optimal model size scales linearly with compute budget

Unigram tokenization does: starts with large vocabulary and prunes using EM

Unigram tokenization starts with a large vocabulary and prunes using EM

[CLS] pooling does: uses the first token's embedding as the sentence representation

CLS pooling: uses the first token's embedding as the sentence representation

Swipe through 100 ML concepts daily