
Newton's method has quadratic convergence but requires cubic computational cost per iteration
Newton's method has quadratic convergence but requires cubic computational cost per iteration
Time complexity of binary search: O(log n) — halves search space each step
Binary search reduces search space by half with each iteration, achieving O(log n) complexity
Why non-convex loss landscapes are hard: many local minima and saddle points
Non-convex landscapes have numerous local minima and saddle points, complicating optimization
Why attention is O(n²) in sequence length: every token attends to every other token
Attention mechanism's complexity arises from pairwise token interactions, leading to quadratic time complexity
Why proximal gradient descent is needed for L1 optimization
Proximal gradient descent handles non-differentiable L1 regularization, enabling sparse solutions
How KV-cache reduces redundant computation in autoregressive generation
KV-cache minimizes redundant computations by storing intermediate results in autoregressive models
Time complexity of Dijkstra's algorithm: O((V+E) log V) with a priority queue
Dijkstra's algorithm: O((V+E) log V) using a Fibonacci heap
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews