CLIP embeds images and text into a shared space using contrastive learning

Contrastive Language–Image Pre-training

CLIP embeds images and text into a shared space using contrastive learning

CLIP leverages contrastive learning to train models for image and text understanding. This approach allows for cross-modal applications, enhancing capabilities in retrieval, generation, and ranking tasks. The shared embedding space facilitates diverse applications across domains.

Example

In cross-modal retrieval, CLIP can match an image of a dog with the text "a dog," demonstrating its effectiveness in bridging visual and textual data.

Understanding CLIP's shared embedding space is crucial for developing advanced cross-modal applications.

Related concepts

cosine similarity is preferred over dot product for normalized embeddings

Cosine similarity measures orientation, not magnitude, making it ideal for normalized embeddings

ALiBi allows length extrapolation better than learned position embeddings

ALiBi uses relative positional encoding, avoiding fixed-size embeddings, enabling better handling of variable-length sequences

Stable Diffusion

Stable Diffusion generates images from text descriptions

[CLS] pooling does: uses the first token's embedding as the sentence representation

CLS pooling: uses the first token's embedding as the sentence representation

mean pooling often outperforms [CLS] for sentence similarity tasks

Mean pooling captures overall sentence meaning better than [CLS] token embedding

768-dim BERT embeddings capture: bidirectional context from masked language modeling

768-dim BERT embeddings capture bidirectional context from masked language modeling

One email a day: 5 concepts + the 5 stories that matter →

Swipe through 100 ML concepts daily

Open TickerNews