chinese clip vit large patch14

OFA-Sys

Introduction

This model is the large version of the Chinese CLIP, utilizing ViT-L/14 as the image encoder and RoBERTa-wwm-base as the text encoder. It is trained on a large-scale dataset containing approximately 200 million Chinese image-text pairs. More details can be found in the technical report and the GitHub repository.

Architecture

Chinese CLIP leverages a vision transformer (ViT-L/14) to process images and a RoBERTa-wwm-base model to process text, implementing a dual-encoder architecture. This setup facilitates the computation of image and text embeddings and their respective similarities.

Training

The model was trained on a dataset of around 200 million Chinese image-text pairs. The training setup allows for zero-shot learning capabilities, meaning it can predict with reasonable accuracy on tasks it was not specifically trained for. The training results were evaluated using several text-to-image and image-to-text retrieval benchmarks, such as MUGE, Flickr30K-CN, and COCO-CN.

Guide: Running Locally

To run the Chinese CLIP model locally, follow these steps:

  1. Install Required Libraries:

    pip install transformers pillow requests
    
  2. Load the Model: Use the following Python snippet to load the model and processor:

    from PIL import Image
    import requests
    from transformers import ChineseCLIPProcessor, ChineseCLIPModel
    
    model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")
    processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")
    
  3. Process Images and Text: Download an image and prepare text inputs. Compute image and text features, and then their similarities:

    url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
    image = Image.open(requests.get(url, stream=True).raw)
    texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
    
    inputs = processor(images=image, return_tensors="pt")
    image_features = model.get_image_features(**inputs)
    image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
    
    inputs = processor(text=texts, padding=True, return_tensors="pt")
    text_features = model.get_text_features(**inputs)
    text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
    
    inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    
  4. Hardware Recommendations: For optimal performance, especially when processing large datasets, consider using cloud GPUs like those available from AWS, Google Cloud, or Azure.

License

Please refer to the GitHub repository for licensing information.

More Related APIs in Zero Shot Image Classification