Taiyi C L I P Roberta 102 M Chinese

IDEA-CCNL

Introduction

Taiyi-CLIP-Roberta-102M-Chinese is the first open-source Chinese CLIP model, designed for powerful visual-language representation. It is pre-trained on 123 million image-text pairs using a RoBERTa-base text encoder.

Architecture

The model employs the chinese-roberta-wwm as the language encoder and the ViT-B-32 architecture from CLIP for the vision encoder. The vision encoder is frozen, allowing only the language encoder to be fine-tuned, enhancing pre-training speed and stability. Pre-training utilizes the Noah-Wukong (100M) and Zero (23M) datasets.

Training

The model underwent 24 epochs of training, completed over 7 days using A100x32 GPUs. It is currently recognized as the first open-source Chinese CLIP model available in the Hugging Face community.

Guide: Running Locally

  1. Install Dependencies:

    pip install torch transformers pillow requests
    
  2. Load and Use the Model:

    from PIL import Image
    import requests
    import torch
    from transformers import BertTokenizer, BertForSequenceClassification, CLIPProcessor, CLIPModel
    
    query_texts = ["一只猫", "一只狗"]
    text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese")
    text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese").eval()
    text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")  
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt")
    
    with torch.no_grad():
        image_features = clip_model.get_image_features(**image)
        text_features = text_encoder(text).logits
        image_features = image_features / image_features.norm(dim=1, keepdim=True)
        text_features = text_features / text_features.norm(dim=1, keepdim=True)
        logit_scale = clip_model.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.t()
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        print(probs)
    
  3. Cloud GPUs: For more efficient processing, consider using cloud services such as AWS, Google Cloud, or Azure with GPU instances.

License

The Taiyi-CLIP-Roberta-102M-Chinese model is licensed under the Apache-2.0 License, allowing for open use and distribution under specified conditions.

More Related APIs in Feature Extraction