Load balancing loss in MoE prevents expert collapse by distributing workload evenly across experts
Image: erwinboogert, CC BY-SA 3.0, via Wikimedia Commons
Load balancing loss in MoE prevents expert collapse by distributing workload evenly across experts
Load balancing (computing)
Load balancing distributes tasks efficiently across resources
MoE models have more parameters but similar compute cost
MoE models distribute parameters across k experts, reducing active experts' compute cost
Mixture of experts
Mixture of experts (MoE) divides problem space into homogeneous regions
kernel fusion reduces memory bandwidth bottleneck
Kernel fusion reduces memory bandwidth bottleneck by combining multiple operations into a single kernel, minimizing data transfers
gradient checkpointing trades: recomputes activations to save memory
Gradient checkpointing trades off computation time for memory savings by recomputing activations
warp divergence kills performance
Warp divergence causes threads to execute non-uniformly, leading to idle cycles and reduced throughput
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews