Structured pruning removes entire filters or attention heads, not individual weights

Image: Hokanson, John R., Public domain, via Wikimedia Commons

structured pruning removes: entire filters or attention heads, not individual weights

Structured pruning removes entire filters or attention heads, not individual weights

Related concepts

to normalize features: when features have different scales and you use distance-based methods

Normalize features when they have different scales for distance-based methods

Pre-LN

Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention

Attention (machine learning)

Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization

Bloom filter

Bloom filters check if an element is possibly in a set with high probability, avoiding false negatives

soft targets carry more information than hard labels: they encode class similarities

Soft targets carry more information than hard labels because they encode class similarities

Attention Is All You Need

O(n) complexity for long sequences

Swipe through 100 ML concepts daily