Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j mistakenly uses x̃ instead of x_i and x_j

Image: MaedaAkihiko This photo was taken with Panasonic Lumix DC-FZ1000 II, CC BY-SA 4.0, via Wikimedia Commons

mixup does: trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j

Trains on convex combinations of pairs: x̃=λx_i+(1-λ)x_j mistakenly uses x̃ instead of x_i and x_j

Related concepts

mixed precision training does: forward in FP16, accumulate gradients in FP32

Mixed precision training: forward in FP16, accumulate gradients in FP32

instruction tuning does: fine-tunes on (instruction, response) pairs

Fine-tunes on (instruction, response) pairs

Convex optimization

Convex functions have only one global minimum

Greedy vs dynamic programming: greedy makes locally optimal choices, DP considers all subproblems

Greedy: locally optimal choices; DP: considers all subproblems

Overlapping subproblems

Dynamic programming solves overlapping subproblems by storing results of subproblems to avoid redundant calculations

AdaGrad does: divides learning rate by sqrt of sum of squared gradients

AdaGrad adjusts learning rate by dividing it by the square root of the sum of squared gradients

Swipe through 100 ML concepts daily