the compute-optimal training ratio is: roughly 20 tokens per parameter

Compute-optimal training ratio: roughly 20 tokens per parameter

Related concepts

MoE models have more parameters but similar compute cost

MoE models distribute parameters across k experts, reducing active experts' compute cost

Glossary of poker terms

Context window limit: maximum tokens model processes simultaneously

quantization to INT8 doubles throughput

Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster

Neural scaling law

Chinchilla scaling law: optimal model size scales linearly with compute budget

Adam vs SGD: Adam adapts per-parameter rates, SGD often generalizes better with tuning

Adam adjusts learning rates per-parameter, SGD generalizes better with tuning

the embedding layer does: maps discrete token IDs to dense learned vectors

Embeddings convert token IDs to dense vectors for neural network processing

Swipe through 100 ML concepts daily