japanese clip vit b 16 LLM Model

Introduction

The Japanese CLIP ViT-B-16 model, developed by rinna Co., Ltd., is a Contrastive Language-Image Pre-Training model for processing Japanese language and images. It utilizes a ViT-B/16 Transformer for image encoding and a 12-layer BERT for text encoding.

Architecture

The model's architecture consists of:

Image Encoder: ViT-B/16 Transformer, initialized from the AugReg vit-base-patch16-224 model.
Text Encoder: 12-layer BERT model.

Training

The model was trained using the CC12M dataset, with captions translated into Japanese, to enhance image and text pairing capabilities.

Guide: Running Locally

Install Package

$ pip install git+https://github.com/rinnakk/japanese-clip.git

Run the Model

import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    max_seq_len=77,
    device=device,
    tokenizer=tokenizer,
)

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Cloud GPUs
Consider using cloud services like AWS, GCP, or Azure for access to powerful GPUs if needed.

License

The model is distributed under the Apache 2.0 License. For more details, see Apache license.

More Related APIs in Feature Extraction