Taiyi C L I P Roberta 102 M Chinese
IDEA-CCNLIntroduction
Taiyi-CLIP-Roberta-102M-Chinese is the first open-source Chinese CLIP model, designed for powerful visual-language representation. It is pre-trained on 123 million image-text pairs using a RoBERTa-base text encoder.
Architecture
The model employs the chinese-roberta-wwm
as the language encoder and the ViT-B-32 architecture from CLIP for the vision encoder. The vision encoder is frozen, allowing only the language encoder to be fine-tuned, enhancing pre-training speed and stability. Pre-training utilizes the Noah-Wukong (100M) and Zero (23M) datasets.
Training
The model underwent 24 epochs of training, completed over 7 days using A100x32 GPUs. It is currently recognized as the first open-source Chinese CLIP model available in the Hugging Face community.
Guide: Running Locally
-
Install Dependencies:
pip install torch transformers pillow requests
-
Load and Use the Model:
from PIL import Image import requests import torch from transformers import BertTokenizer, BertForSequenceClassification, CLIPProcessor, CLIPModel query_texts = ["一只猫", "一只狗"] text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese") text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese").eval() text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids'] url = "http://images.cocodataset.org/val2017/000000039769.jpg" clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt") with torch.no_grad(): image_features = clip_model.get_image_features(**image) text_features = text_encoder(text).logits image_features = image_features / image_features.norm(dim=1, keepdim=True) text_features = text_features / text_features.norm(dim=1, keepdim=True) logit_scale = clip_model.logit_scale.exp() logits_per_image = logit_scale * image_features @ text_features.t() probs = logits_per_image.softmax(dim=-1).cpu().numpy() print(probs)
-
Cloud GPUs: For more efficient processing, consider using cloud services such as AWS, Google Cloud, or Azure with GPU instances.
License
The Taiyi-CLIP-Roberta-102M-Chinese model is licensed under the Apache-2.0 License, allowing for open use and distribution under specified conditions.