768-dim BERT embeddings capture bidirectional context from masked language modeling
Image: U.S. Embassy, Jakarta from Jakarta, Indonesia, Public domain, via Wikimedia Commons
768-dim BERT embeddings capture bidirectional context from masked language modeling
weight tying does in language models: shares embedding and output projection matrices
Tying reduces the number of parameters by sharing embedding and output projection matrices
cosine similarity is preferred over dot product for normalized embeddings
Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings
384-dim all-MiniLM-L6-v2 optimizes: fast sentence similarity with 6 layers
All-MiniLM-L6-v2 optimizes fast sentence similarity with 6 layers
ALiBi allows length extrapolation better than learned position embeddings
ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
1536-dim OpenAI text-embedding-3-large is used for: semantic search and RAG
Used for semantic search, RAG, and enhancing language models' understanding
mean pooling often outperforms [CLS] for sentence similarity tasks
Mean pooling captures overall sentence meaning better than [CLS] token embedding
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews