
TensorRT optimizes deep learning inference by quantizing and fusing operations for NVIDIA GPUs
Image: BigRiz, CC BY-SA 3.0, via Wikimedia Commons
TensorRT optimizes deep learning inference by quantizing and fusing operations for NVIDIA GPUs
quantization to INT8 doubles throughput
Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster
tensor cores are
Tensor cores are specialized hardware for matrix multiply-accumulate on GPU
Adam has bias correction: divides by (1-β^t) in early steps
Adam bias correction divides by (1-β^t) in early steps to counteract initial bias from accumulated gradients
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
CUDA
CUDA enables parallel computation on GPUs
GPTQ quantization does
Post-training quantization using second-order information for model compression
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews