Chinchilla scaling law: optimal model size scales linearly with compute budget
Chinchilla scaling law: optimal model size scales linearly with compute budget
The Chinchilla scaling law demonstrates a direct relationship between the size of a neural network model and the available compute budget. This scaling law is particularly relevant for optimizing resource allocation in machine learning tasks.
Example
In a practical scenario, if a machine learning team has a fixed compute budget, they can increase the model size proportionally to maximize performance, as suggested by the Chinchilla scaling law.
Understanding this scaling law helps in efficiently utilizing compute resources to achieve optimal model performance.
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
2024 in hip-hop
LoRA rank r controls model capacity and parameters
the vocabulary size matters: larger vocab = shorter sequences but more parameters
Larger vocab reduces sequence length, increasing model complexity and parameters
the determinant tells you about volume scaling under a linear transformation
The determinant of a matrix representing a linear transformation indicates the factor by which volumes are scaled
the compute-optimal training ratio is: roughly 20 tokens per parameter
Compute-optimal training ratio: roughly 20 tokens per parameter
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews