
"trained on word co-occurrence with skip-gram window"
Image: Vadim Zhuravlev, Public domain, via Wikimedia Commons
"trained on word co-occurrence with skip-gram window"
384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers
All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers
the vocabulary size matters: larger vocab = shorter sequences but more parameters
Larger vocab reduces sequence length, increasing model complexity and parameters
[CLS] pooling does: uses the first token's embedding as the sentence representation
CLS pooling: uses the first token's embedding as the sentence representation
WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency
WordPiece tokenization splits words into subwords based on token likelihood rather than frequency
768-dim BERT embeddings capture: bidirectional context from masked language modeling
768-dim BERT embeddings capture bidirectional context from masked language modeling
word error rate (WER) measures: edit distance between predicted and reference transcriptions
Word Error Rate (WER) measures the edit distance between predicted and reference transcriptions
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews