japanese clip vit b 16
rinnaIntroduction
The Japanese CLIP ViT-B-16 model, developed by rinna Co., Ltd., is a Contrastive Language-Image Pre-Training model for processing Japanese language and images. It utilizes a ViT-B/16 Transformer for image encoding and a 12-layer BERT for text encoding.
Architecture
The model's architecture consists of:
- Image Encoder: ViT-B/16 Transformer, initialized from the AugReg vit-base-patch16-224 model.
- Text Encoder: 12-layer BERT model.
Training
The model was trained using the CC12M dataset, with captions translated into Japanese, to enhance image and text pairing capabilities.
Guide: Running Locally
-
Install Package
$ pip install git+https://github.com/rinnakk/japanese-clip.git
-
Run the Model
import io import requests from PIL import Image import torch import japanese_clip as ja_clip device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device) tokenizer = ja_clip.load_tokenizer() img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content)) image = preprocess(img).unsqueeze(0).to(device) encodings = ja_clip.tokenize( texts=["犬", "猫", "象"], max_seq_len=77, device=device, tokenizer=tokenizer, ) with torch.no_grad(): image_features = model.get_image_features(image) text_features = model.get_text_features(**encodings) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs)
-
Cloud GPUs
Consider using cloud services like AWS, GCP, or Azure for access to powerful GPUs if needed.
License
The model is distributed under the Apache 2.0 License. For more details, see Apache license.