Learn Before
Concept

CLIP (Contrastive Language-Image Pre-training)

CLIP (Contrastive Language-Image Pre-training) is a multimodal model that encodes both text and images by combining the text encoding capabilities of models like GPT-2 with a vision Transformer. The resulting image and text embeddings from CLIP were later foundational to the development of the DALL-E 2 text-to-image system.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Learn After