Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j mistakenly uses x̃ instead of x_i and x_j
Image: MaedaAkihiko This photo was taken with Panasonic Lumix DC-FZ1000 II, CC BY-SA 4.0, via Wikimedia Commons
Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j mistakenly uses x̃ instead of x_i and x_j
mixed precision training does: forward in FP16, accumulate gradients in FP32
Mixed precision training: forward in FP16, accumulate gradients in FP32
instruction tuning does: fine-tunes on (instruction, response) pairs
Fine-tunes on (instruction, response) pairs
Convex optimization
Convex functions have only one global minimum
Greedy vs dynamic programming: greedy makes locally optimal choices, DP considers all subproblems
Greedy: locally optimal choices; DP: considers all subproblems
Overlapping subproblems
Dynamic programming solves overlapping subproblems by storing results of subproblems to avoid redundant calculations
AdaGrad does: divides learning rate by sqrt of sum of squared gradients
AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews