Load balancing loss in MoE prevents expert collapse by distributing workload evenly across experts

Image: erwinboogert, CC BY-SA 3.0, via Wikimedia Commons

load balancing loss is needed in MoE

Load balancing loss in MoE prevents expert collapse by distributing workload evenly across experts

Related concepts

Load balancing (computing)

Load balancing distributes tasks efficiently across resources

MoE models have more parameters but similar compute cost

MoE models distribute parameters across k experts, reducing active experts' compute cost

Mixture of experts

Mixture of experts (MoE) divides problem space into homogeneous regions

kernel fusion reduces memory bandwidth bottleneck

Kernel fusion reduces memory bandwidth bottleneck by combining multiple operations into a single kernel, minimizing data transfers

gradient checkpointing trades: recomputes activations to save memory

Gradient checkpointing trades off computation time for memory savings by recomputing activations

warp divergence kills performance

Warp divergence causes threads to execute non-uniformly, leading to idle cycles and reduced throughput

Swipe through 100 ML concepts daily