chinese clip vit base patch16 LLM Model

Introduction

Chinese CLIP is a base-version model using ViT-B/16 for image encoding and RoBERTa-wwm-base for text encoding. It is designed to work with a large-scale dataset of approximately 200 million Chinese image-text pairs. More information is available via the technical report and the GitHub repository.

Architecture

The architecture consists of two main components: a Vision Transformer (ViT-B/16) for processing images and a RoBERTa-wwm-base model for processing text. This combination allows Chinese CLIP to effectively compute embeddings and similarities between image and text data.

Training

Training of the Chinese CLIP model involved a large-scale dataset with focus on contrastive learning to align visual and textual representations. The model is evaluated on various benchmarks, including MUGE, Flickr30K-CN, and COCO-CN retrieval tasks, demonstrating strong performance in both zero-shot and fine-tuned settings.

Guide: Running Locally

Install Required Libraries:

pip install transformers pillow requests

Load Model and Processor:

from transformers import ChineseCLIPProcessor, ChineseCLIPModel
model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")

Process an Image and Texts: Load an image via URL and compute embeddings:

from PIL import Image
import requests

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)

Compute and Normalize Features: Normalize the computed features for image-text similarity scoring.
Recommendation: Use cloud GPUs from providers like AWS, Google Cloud, or Azure for efficient processing, especially for large datasets.

License

The model and associated code are released under an open-source license, allowing for both personal and commercial use, subject to the terms specified in the repository's license file.

More Related APIs in Zero Shot Image Classification