Larger vocab reduces sequence length, increasing model complexity and parameters
Image: NOAA Photo Library, Public domain, via Wikimedia Commons
Larger vocab reduces sequence length, increasing model complexity and parameters
weight tying does in language models: shares embedding and output projection matrices
Tying reduces the number of parameters by sharing embedding and output projection matrices
Attention Is All You Need
O(n) complexity for long sequences
batch size affects generalization: larger batches find sharper minima
Larger batch sizes lead to sharper minima, enhancing generalization by providing more accurate gradient estimates
WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency
WordPiece tokenization splits words into subwords based on token likelihood rather than frequency
384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers
All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers
soft targets carry more information than hard labels: they encode class similarities
Soft targets carry more information than hard labels because they encode class similarities
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews