japanese clip vit b 16

rinna

Introduction

The Japanese CLIP ViT-B-16 model, developed by rinna Co., Ltd., is a Contrastive Language-Image Pre-Training model for processing Japanese language and images. It utilizes a ViT-B/16 Transformer for image encoding and a 12-layer BERT for text encoding.

Architecture

The model's architecture consists of:

  • Image Encoder: ViT-B/16 Transformer, initialized from the AugReg vit-base-patch16-224 model.
  • Text Encoder: 12-layer BERT model.

Training

The model was trained using the CC12M dataset, with captions translated into Japanese, to enhance image and text pairing capabilities.

Guide: Running Locally

  1. Install Package

    $ pip install git+https://github.com/rinnakk/japanese-clip.git
    
  2. Run the Model

    import io
    import requests
    from PIL import Image
    import torch
    import japanese_clip as ja_clip
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    
    model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device)
    tokenizer = ja_clip.load_tokenizer()
    
    img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
    image = preprocess(img).unsqueeze(0).to(device)
    encodings = ja_clip.tokenize(
        texts=["犬", "猫", "象"],
        max_seq_len=77,
        device=device,
        tokenizer=tokenizer,
    )
    
    with torch.no_grad():
        image_features = model.get_image_features(image)
        text_features = model.get_text_features(**encodings)
        text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
    print("Label probs:", text_probs)
    
  3. Cloud GPUs
    Consider using cloud services like AWS, GCP, or Azure for access to powerful GPUs if needed.

License

The model is distributed under the Apache 2.0 License. For more details, see Apache license.

More Related APIs in Feature Extraction