Vi T S O400 M 14 Sig L I P 384
timmIntroduction
The ViT-SO400M-14-SigLIP-384 model is a SigLIP (Sigmoid loss for Language-Image Pre-training) model designed for zero-shot image classification tasks. It has been converted to PyTorch from the original JAX checkpoints in Big Vision, making it compatible with both OpenCLIP (image + text) and timm (image only) libraries.
Architecture
- Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
- Dataset: Trained on WebLI.
- Libraries: OpenCLIP and timm.
- Papers: Refer to "Sigmoid loss for language image pre-training" on arXiv (2303.15343).
Training
The model utilizes sigmoid loss for language-image pre-training, enhancing its performance in zero-shot image classification by learning to align image and text features effectively.
Guide: Running Locally
To run the model locally, you can follow these steps:
-
Install Required Libraries:
pip install torch torchvision timm open_clip_torch
-
Load Model with OpenCLIP:
import torch from urllib.request import urlopen from PIL import Image from open_clip import create_model_from_pretrained, get_tokenizer model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-SO400M-14-SigLIP-384') tokenizer = get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP-384') image = Image.open(urlopen('IMAGE_URL')) image = preprocess(image).unsqueeze(0) labels_list = ["a dog", "a cat", "a donut", "a beignet"] text = tokenizer(labels_list, context_length=model.context_length) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text)
-
Running with TIMM:
import timm image = Image.open(urlopen('IMAGE_URL')) model = timm.create_model('vit_so400m_patch14_siglip_384', pretrained=True, num_classes=0) model = model.eval() # Model-specific transforms data_config = timm.data.resolve_model_data_config(model) transforms = timm.data.create_transform(**data_config, is_training=False) output = model(transforms(image).unsqueeze(0))
-
Use a Cloud GPU: For better performance, consider using cloud services like AWS EC2, Google Cloud, or Azure for GPU resources.
License
The ViT-SO400M-14-SigLIP-384 model is licensed under the Apache-2.0 License.