Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j represents a weighted average of two points in a convex set
Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j represents a weighted average of two points in a convex set
Why non-convex loss landscapes are hard: many local minima and saddle points
Non-convex landscapes have numerous local minima and saddle points, complicating optimization
What weight tying does in language models: shares embedding and output projection matrices
Language models use tied weights to share embedding and output projection matrices, enhancing parameter efficiency
What AdaGrad does: divides learning rate by sqrt of sum of squared gradients
AdaGrad adapts learning rates based on historical gradients, reducing for frequently updated features
What score matching does: learns the gradient of the log-density without normalizing
Score matching approximates log-density gradients for variational inference without normalization
Write the formula for Lagrangian L(x,λ) = f(x) - λg(x)
L(x,λ) = f(x) - λ∫g(x)dx, where λ is Lagrange multiplier
How does score matching utilize the Fisher Information Matrix to learn the parameters of a probabilistic model without normalizing the score?
Score matching estimates parameters by minimizing the Kullback-Leibler divergence between empirical and model score distributions
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews