chinese clip vit large patch14
OFA-SysIntroduction
This model is the large version of the Chinese CLIP, utilizing ViT-L/14 as the image encoder and RoBERTa-wwm-base as the text encoder. It is trained on a large-scale dataset containing approximately 200 million Chinese image-text pairs. More details can be found in the technical report and the GitHub repository.
Architecture
Chinese CLIP leverages a vision transformer (ViT-L/14) to process images and a RoBERTa-wwm-base model to process text, implementing a dual-encoder architecture. This setup facilitates the computation of image and text embeddings and their respective similarities.
Training
The model was trained on a dataset of around 200 million Chinese image-text pairs. The training setup allows for zero-shot learning capabilities, meaning it can predict with reasonable accuracy on tasks it was not specifically trained for. The training results were evaluated using several text-to-image and image-to-text retrieval benchmarks, such as MUGE, Flickr30K-CN, and COCO-CN.
Guide: Running Locally
To run the Chinese CLIP model locally, follow these steps:
-
Install Required Libraries:
pip install transformers pillow requests
-
Load the Model: Use the following Python snippet to load the model and processor:
from PIL import Image import requests from transformers import ChineseCLIPProcessor, ChineseCLIPModel model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14") processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")
-
Process Images and Text: Download an image and prepare text inputs. Compute image and text features, and then their similarities:
url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" image = Image.open(requests.get(url, stream=True).raw) texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"] inputs = processor(images=image, return_tensors="pt") image_features = model.get_image_features(**inputs) image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) inputs = processor(text=texts, padding=True, return_tensors="pt") text_features = model.get_text_features(**inputs) text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) inputs = processor(text=texts, images=image, return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1)
-
Hardware Recommendations: For optimal performance, especially when processing large datasets, consider using cloud GPUs like those available from AWS, Google Cloud, or Azure.
License
Please refer to the GitHub repository for licensing information.