300-dim word2vec encodes: trained on word co-occurrence with skip-gram window

"trained on word co-occurrence with skip-gram window"

Related concepts

384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers

All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers

the vocabulary size matters: larger vocab = shorter sequences but more parameters

Larger vocab reduces sequence length, increasing model complexity and parameters

[CLS] pooling does: uses the first token's embedding as the sentence representation

CLS pooling: uses the first token's embedding as the sentence representation

WordPiece tokenization does: similar to BPE but uses likelihood instead of frequency

WordPiece tokenization splits words into subwords based on token likelihood rather than frequency

768-dim BERT embeddings capture: bidirectional context from masked language modeling

768-dim BERT embeddings capture bidirectional context from masked language modeling

word error rate (WER) measures: edit distance between predicted and reference transcriptions

Word Error Rate (WER) measures the edit distance between predicted and reference transcriptions

Swipe through 100 ML concepts daily