OPEN AI: Introducing X-CLIP


Last year, OpenAI's CLIP made waves in the AI community with its remarkable ability to bridge language and images, trained on a vast dataset encompassing 400 million (image, text) pairs. This model, comprised of a vision encoder and a text encoder, showcased impressive zero-shot capabilities, revolutionizing fields like AI art with CLIP-guided creations, such as Stable Diffusion by Stability AI. Now, a new milestone emerges with the introduction of X-CLIP, developed by Microsoft and now available in Hugging Face Transformers. X-CLIP represents a minimal extension of CLIP tailored specifically for general video-language understanding. The model sets a new standard, achieving a state-of-the-art top-1 accuracy of 87.1% on Kinetics-400, a benchmark deemed crucial by Google DeepMind. Noteworthy is its continued prowess in zero- and few-shot learning scenarios. Leveraging the pre-trained weights of CLIP as a starting point, the authors meticulously fine-tuned X-CLIP to excel in video comprehension tasks. With its impending presentation as an oral paper at ECCV22, X-CLIP marks a significant advancement in the realm of video-language models, promising to unlock new possibilities in understanding and interpreting multimedia content. Stay tuned as X-CLIP shapes the future of AI-driven video comprehension.