MoE models distribute parameters across k experts, reducing active experts' compute cost

Image: Unknown authorUnknown author, Public domain, via Wikimedia Commons

MoE models have more parameters but similar compute cost

MoE models distribute parameters across k experts, reducing active experts' compute cost

Related concepts

load balancing loss is needed in MoE

Load balancing loss in MoE prevents expert collapse by distributing workload evenly across experts

Mixture of experts

Mixture of experts (MoE) divides problem space into homogeneous regions

AWQ does differently

AWQ selectively retains weights crucial for model performance, unlike traditional quantization

Adam combines momentum and RMSprop: adapts per-parameter learning rates

Adam combines momentum and RMSprop by adapting per-parameter learning rates

KV-cache reduces redundant computation in autoregressive generation

KV-cache stores previously computed outputs to avoid redundant calculations in autoregressive models

Tesla Model Y

Tesla Model Y is the world's best-selling electric vehicle in 2023

Swipe through 100 ML concepts daily