chinese clip vit base patch16
OFA-SysIntroduction
Chinese CLIP is a base-version model using ViT-B/16 for image encoding and RoBERTa-wwm-base for text encoding. It is designed to work with a large-scale dataset of approximately 200 million Chinese image-text pairs. More information is available via the technical report and the GitHub repository.
Architecture
The architecture consists of two main components: a Vision Transformer (ViT-B/16) for processing images and a RoBERTa-wwm-base model for processing text. This combination allows Chinese CLIP to effectively compute embeddings and similarities between image and text data.
Training
Training of the Chinese CLIP model involved a large-scale dataset with focus on contrastive learning to align visual and textual representations. The model is evaluated on various benchmarks, including MUGE, Flickr30K-CN, and COCO-CN retrieval tasks, demonstrating strong performance in both zero-shot and fine-tuned settings.
Guide: Running Locally
-
Install Required Libraries:
pip install transformers pillow requests
-
Load Model and Processor:
from transformers import ChineseCLIPProcessor, ChineseCLIPModel model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16") processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
-
Process an Image and Texts: Load an image via URL and compute embeddings:
from PIL import Image import requests url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" image = Image.open(requests.get(url, stream=True).raw) texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"] inputs = processor(images=image, return_tensors="pt") image_features = model.get_image_features(**inputs)
-
Compute and Normalize Features: Normalize the computed features for image-text similarity scoring.
-
Recommendation: Use cloud GPUs from providers like AWS, Google Cloud, or Azure for efficient processing, especially for large datasets.
License
The model and associated code are released under an open-source license, allowing for both personal and commercial use, subject to the terms specified in the repository's license file.