Maximizes θ to maximize the probability of observed data given θ
Maximizes θ to maximize the probability of observed data given θ
What sufficient statistics are — compress data without losing information about θ
Sufficient statistics for θ are those that capture all necessary information to estimate the parameter
How does score matching utilize the Fisher Information Matrix to learn the parameters of a probabilistic model without normalizing the score?
Score matching estimates parameters by minimizing the Kullback-Leibler divergence between empirical and model score distributions
What the compute-optimal training ratio is: roughly 20 tokens per parameter
Optimal training ratio: Approximately 20 tokens/parameter
Write the policy gradient theorem equation
E[\nabla_\theta J(\theta)] = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s)]
A p-value < 0.05 means: if H₀ is true, this result has <5% probability
A p-value < 0.05 indicates a less than 5% chance of observing data as extreme as this if the null hypothesis is true
What calibration means: a model predicting 80% should be correct 80% of the time
Calibration: Model's predicted probabilities match actual outcomes' frequencies
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews