Context window limit: maximum tokens model processes simultaneously
Image: Santa Clara Valley Transportation Authority, Public domain, via Wikimedia Commons
Context window limit: maximum tokens model processes simultaneously
the compute-optimal training ratio is: roughly 20 tokens per parameter
Compute-optimal training ratio: roughly 20 tokens per parameter
the vocabulary size matters: larger vocab = shorter sequences but more parameters
Larger vocab reduces sequence length, increasing model complexity and parameters
Neural scaling law
Chinchilla scaling law: optimal model size scales linearly with compute budget
Unigram tokenization does: starts with large vocabulary and prunes using EM
Unigram tokenization starts with a large vocabulary and prunes using EM
register pressure means: too many variables per thread reduces occupancy
Register pressure: Excessive variables per thread lead to reduced occupancy and potential performance bottlenecks
[CLS] pooling does: uses the first token's embedding as the sentence representation
CLS pooling: uses the first token's embedding as the sentence representation
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews