
Soft targets carry more information than hard labels because they encode class similarities
Image: Unknown authorUnknown author, Public domain, via Wikimedia Commons
Soft targets carry more information than hard labels because they encode class similarities
cosine similarity is preferred over dot product for normalized embeddings
Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings
Masking (behavior)
Causal masking prevents attention to future tokens in the decoder
mean pooling often outperforms [CLS] for sentence similarity tasks
Mean pooling captures overall sentence meaning better than [CLS] token embedding
the vocabulary size matters: larger vocab = shorter sequences but more parameters
Larger vocab reduces sequence length, increasing model complexity and parameters
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
autoencoders learn the data manifold
Autoencoders compress data manifold by forcing information through a bottleneck layer, learning efficient representations
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews