Structured pruning removes entire filters or attention heads, not individual weights
Image: Hokanson, John R., Public domain, via Wikimedia Commons
Structured pruning removes entire filters or attention heads, not individual weights
to normalize features: when features have different scales and you use distance-based methods
Normalize features when they have different scales for distance-based methods
Pre-LN
Pre-LN: LayerNorm before attention; Post-LN: LayerNorm after attention
Attention (machine learning)
Flash attention speeds up processing by tiling attention across input, avoiding N×N matrix materialization
Bloom filter
Bloom filters check if an element is possibly in a set with high probability, avoiding false negatives
soft targets carry more information than hard labels: they encode class similarities
Soft targets carry more information than hard labels because they encode class similarities
Attention Is All You Need
O(n) complexity for long sequences
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews