chinese clip vit large patch14 336px
OFA-SysIntroduction
This document details the large version of the Chinese CLIP model, which utilizes ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder. The model is trained on a substantial dataset comprising approximately 200 million Chinese image-text pairs. Further technical insights are available in the technical report and the official GitHub repository.
Architecture
The model employs a dual-encoder architecture with a vision transformer (ViT-L/14@336px) for image processing and a RoBERTa-wwm-base model for text processing. These components facilitate the extraction of meaningful features from both modalities, enabling effective cross-modal understanding and retrieval tasks.
Training
Chinese CLIP was trained on a massive dataset of 200 million Chinese image-text pairs using a contrastive learning approach. This method aligns image and text embeddings in a shared latent space, optimizing the model's ability to perform zero-shot image classification and retrieval tasks.
Guide: Running Locally
To run the model locally, follow these steps:
- Install Packages: Ensure you have
transformers
,torch
, andPIL
installed. Use pip to install if necessary. - Load Model and Processor:
from transformers import ChineseCLIPProcessor, ChineseCLIPModel model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px") processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14-336px")
- Prepare Data: Load your image and text data. Images can be loaded using PIL, and text is prepared as a list of strings.
- Compute Features and Similarity: Process and normalize the image and text features, then calculate similarity scores as shown in the example code provided in the Introduction section.
- Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The Chinese CLIP model and associated code are available under the terms specified in the GitHub repository. Ensure compliance with the licensing terms when using or modifying the model.