CLIP embeds images and text into a shared space using contrastive learning
CLIP embeds images and text into a shared space using contrastive learning
CLIP leverages contrastive learning to train models for image and text understanding. This approach allows for cross-modal applications, enhancing capabilities in retrieval, generation, and ranking tasks. The shared embedding space facilitates diverse applications across domains.
Example
In cross-modal retrieval, CLIP can match an image of a dog with the text "a dog," demonstrating its effectiveness in bridging visual and textual data.
Understanding CLIP's shared embedding space is crucial for developing advanced cross-modal applications.
cosine similarity is preferred over dot product for normalized embeddings
Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings
ALiBi allows length extrapolation better than learned position embeddings
ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences
Stable Diffusion
Stable Diffusion generates images from text descriptions
[CLS] pooling does: uses the first token's embedding as the sentence representation
CLS pooling: uses the first token's embedding as the sentence representation
mean pooling often outperforms [CLS] for sentence similarity tasks
Mean pooling captures overall sentence meaning better than [CLS] token embedding
768-dim BERT embeddings capture: bidirectional context from masked language modeling
768-dim BERT embeddings capture bidirectional context from masked language modeling
One email a day: 5 concepts + the 5 stories that matter →
Swipe through 100 ML concepts daily
Open TickerNews