chinese clip vit base patch16

OFA-Sys

Introduction

Chinese CLIP is a base-version model using ViT-B/16 for image encoding and RoBERTa-wwm-base for text encoding. It is designed to work with a large-scale dataset of approximately 200 million Chinese image-text pairs. More information is available via the technical report and the GitHub repository.

Architecture

The architecture consists of two main components: a Vision Transformer (ViT-B/16) for processing images and a RoBERTa-wwm-base model for processing text. This combination allows Chinese CLIP to effectively compute embeddings and similarities between image and text data.

Training

Training of the Chinese CLIP model involved a large-scale dataset with focus on contrastive learning to align visual and textual representations. The model is evaluated on various benchmarks, including MUGE, Flickr30K-CN, and COCO-CN retrieval tasks, demonstrating strong performance in both zero-shot and fine-tuned settings.

Guide: Running Locally

  1. Install Required Libraries:

    pip install transformers pillow requests
    
  2. Load Model and Processor:

    from transformers import ChineseCLIPProcessor, ChineseCLIPModel
    model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
    processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
    
  3. Process an Image and Texts: Load an image via URL and compute embeddings:

    from PIL import Image
    import requests
    
    url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
    image = Image.open(requests.get(url, stream=True).raw)
    texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
    
    inputs = processor(images=image, return_tensors="pt")
    image_features = model.get_image_features(**inputs)
    
  4. Compute and Normalize Features: Normalize the computed features for image-text similarity scoring.

  5. Recommendation: Use cloud GPUs from providers like AWS, Google Cloud, or Azure for efficient processing, especially for large datasets.

License

The model and associated code are released under an open-source license, allowing for both personal and commercial use, subject to the terms specified in the repository's license file.

More Related APIs in Zero Shot Image Classification