
Compute-optimal training ratio: roughly 20 tokens per parameter
Image: BlendoGames, CC BY 2.0, via Wikimedia Commons
Compute-optimal training ratio: roughly 20 tokens per parameter
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
Glossary of poker terms
Context window limit: maximum tokens model processes simultaneously
quantization to INT8 doubles throughput
Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster
Neural scaling law
Chinchilla scaling law: optimal model size scales linearly with compute budget
Adam vs SGD: Adam adapts per-parameter rates, SGD often generalizes better with tuning
Adam adjusts learning rates per-parameter, SGD generalizes better with tuning
the embedding layer does: maps discrete token IDs to dense learned vectors
Embeddings convert token IDs to dense vectors for neural network processing
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews