chinese clip vit huge patch14

OFA-Sys

Introduction

Chinese CLIP with ViT-H/14 as the image encoder and RoBERTa-wwm-large as the text encoder is a large-scale model designed for multilingual image-text pairing. It is trained on approximately 200 million Chinese image-text pairs. More details can be found in the technical report and the GitHub repository.

Architecture

The model uses a Vision Transformer (ViT-H/14) for image encoding and RoBERTa-wwm-large for text encoding. This architecture supports tasks such as zero-shot image classification and image-text retrieval by computing embeddings and similarity scores for images and text.

Training

Training details, including data handling and model optimization, can be found in the GitHub repository. The model is pre-trained on a large dataset of Chinese image-text pairs, facilitating strong performance in tasks involving Chinese language and visual data.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python, PyTorch, and the transformers library installed.
  2. Load the Model:
    from transformers import ChineseCLIPProcessor, ChineseCLIPModel
    
    model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
    processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-huge-patch14")
    
  3. Process Images and Text: Use the processor to convert images and text into tensors.
  4. Compute Features: Calculate image and text features, normalize them, and compute similarity scores.
  5. Utilize Cloud GPUs: For large-scale or intensive tasks, consider using cloud services such as AWS, Azure, or Google Cloud for GPU resources.

License

For information on licensing, please refer to the official GitHub repository or the Hugging Face model card for Chinese CLIP.

More Related APIs in Zero Shot Image Classification