
Post-training quantization using second-order information for model compression
Image: Sven Behnke, CC BY-SA 4.0, via Wikimedia Commons
Post-training quantization using second-order information for model compression
AWQ does differently
AWQ selectively retains weights crucial for model performance, unlike traditional quantization
GPTQ vs AWQ: GPTQ uses Hessian-based quantization, AWQ preserves activation-important weights
GPTQ applies Hessian-based quantization, AWQ retains weights crucial for activations
quantization to INT8 doubles throughput
Quantization to INT8 doubles throughput because tensor cores process INT8 2x faster
Vector quantization
Product quantization compresses vectors by splitting them into subvectors and quantizing each subvector independently
Shannon's source coding theorem: you can't compress below entropy
Shannon's theorem: Data compression can't exceed entropy limit
ALiBi allows length extrapolation better than learned position embeddings
ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews