AWQ selectively retains weights crucial for model performance, unlike traditional quantization
Image: O'Connor P, Neil D, Liu S, Delbruck T, Pfeiffer M, CC BY 3.0, via Wikimedia Commons
AWQ selectively retains weights crucial for model performance, unlike traditional quantization
GPTQ vs AWQ: GPTQ uses Hessian-based quantization, AWQ preserves activation-important weights
GPTQ applies Hessian-based quantization, AWQ retains weights crucial for activations
weight initialization matters: Xavier/He init keeps activation variance ≈ 1 across layers
Weight initialization stabilizes learning by maintaining consistent activation variance
GPTQ quantization does
Post-training quantization using second-order information for model compression
quantization to INT8 doubles throughput
Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
gradient accumulation simulates larger batch sizes without more memory
Gradient accumulation reduces memory usage by dividing a large batch into smaller mini-batches, accumulating gradients before updating model weights
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews