Maximizes θ to maximize the probability of observed data given θ

What maximum likelihood estimation does: find θ maximizing P(data|θ)

Maximizes θ to maximize the probability of observed data given θ

Related concepts

What sufficient statistics are — compress data without losing information about θ

Sufficient statistics for θ are those that capture all necessary information to estimate the parameter

How does score matching utilize the Fisher Information Matrix to learn the parameters of a probabilistic model without normalizing the score?

Score matching estimates parameters by minimizing the Kullback-Leibler divergence between empirical and model score distributions

What the compute-optimal training ratio is: roughly 20 tokens per parameter

Optimal training ratio: Approximately 20 tokens/parameter

Write the policy gradient theorem equation

E[\nabla_\theta J(\theta)] = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s)]

A p-value < 0.05 means: if H₀ is true, this result has <5% probability

A p-value < 0.05 indicates a less than 5% chance of observing data as extreme as this if the null hypothesis is true

What calibration means: a model predicting 80% should be correct 80% of the time

Calibration: Model's predicted probabilities match actual outcomes' frequencies

Swipe through 100 ML concepts daily