
GPTQ applies Hessian-based quantization, AWQ retains weights crucial for activations
Image: USAID, Public domain, via Wikimedia Commons
GPTQ applies Hessian-based quantization, AWQ retains weights crucial for activations
AWQ does differently
AWQ selectively retains weights crucial for model performance, unlike traditional quantization
GPTQ quantization does
Post-training quantization using second-order information for model compression
grouped query attention (GQA) does
GQA shares KV heads across multiple Q heads for efficient parameter usage
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
PCA vs t-SNE: PCA preserves global variance linearly, t-SNE preserves local structure nonlinearly
PCA: Linear variance preservation, t-SNE: Nonlinear local structure preservation
AdaGrad's learning rate decays to zero
AdaGrad adjusts learning rate by accumulating squared gradients, causing it to decay to zero as denominator grows exponentially
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews