MoE models distribute parameters across k experts, reducing active experts' compute cost
Image: Unknown authorUnknown author, Public domain, via Wikimedia Commons
MoE models distribute parameters across k experts, reducing active experts' compute cost
load balancing loss is needed in MoE
Load balancing loss in MoE prevents expert collapse by distributing workload evenly across experts
Mixture of experts
Mixture of experts (MoE) divides problem space into homogeneous regions
AWQ does differently
AWQ selectively retains weights crucial for model performance, unlike traditional quantization
Adam combines momentum and RMSprop: adapts per-parameter learning rates
Adam combines momentum and RMSprop by adapting per-parameter learning rates
KV-cache reduces redundant computation in autoregressive generation
KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models
Tesla Model Y
Tesla Model Y is the world's best-selling electric vehicle in 2023
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews