
Continuous batching adds new requests to a running batch without waiting
Image: Bp2010.hprastiawan, CC BY-SA 3.0, via Wikimedia Commons
Continuous batching adds new requests to a running batch without waiting
consistent hashing does: minimizes remapping when nodes join/leave
Consistent hashing distributes data across nodes, minimizing remapping when nodes join/leave
consistent hashing solves: minimizes key redistribution when servers are added/removed
Consistent hashing minimizes key redistribution when servers are added/removed
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
paged attention (vLLM) improves serving throughput
Paged attention (vLLM) improves serving throughput by reducing latency through non-contiguous KV-cache pages, enabling faster data retrieval
Tracing
Tracing records operations, scripting parses Python
Load balancing (computing)
Load balancing distributes tasks efficiently across resources
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews