Meta's DINOv2 Joins Transformers
Meta, the tech giant behind Facebook and Instagram, has introduced DINOv2, a formidable computer vision model trained on a staggering 142 million images. Now integrated into the 🤗 Transformers library, DINOv2 represents a significant leap in vision capabilities. Built upon the Vision Transformer (ViT) architecture pioneered by Google researchers, DINOv2 stands out as one of the most robust vision backbones available today. Remarkably, its training process is entirely self-supervised, eschewing the need for human-labeled data. DINOv2 marks a crucial milestone towards the development of universal vision models, capable of swiftly recognizing objects within images without exhaustive datasets. Its high-performance features can be directly harnessed as inputs for simple linear classifiers, eliminating the need for extensive fine-tuning. The release includes four checkpoints: small, base, and large versions derived from a frozen ViT-giant, a Transformer model comprising 40 layers. To facilitate adoption and exploration, a tutorial has been curated detailing the process of training a linear classifier for semantic segmentation using DINOv2's frozen features. With DINOv2's arrival, Meta signals a bold stride towards democratizing cutting-edge vision technologies, promising enhanced efficiency and accuracy across a spectrum of visual tasks.