Vi T S O400 M 14 Sig L I P 384

timm

Introduction

The ViT-SO400M-14-SigLIP-384 model is a SigLIP (Sigmoid loss for Language-Image Pre-training) model designed for zero-shot image classification tasks. It has been converted to PyTorch from the original JAX checkpoints in Big Vision, making it compatible with both OpenCLIP (image + text) and timm (image only) libraries.

Architecture

  • Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
  • Dataset: Trained on WebLI.
  • Libraries: OpenCLIP and timm.
  • Papers: Refer to "Sigmoid loss for language image pre-training" on arXiv (2303.15343).

Training

The model utilizes sigmoid loss for language-image pre-training, enhancing its performance in zero-shot image classification by learning to align image and text features effectively.

Guide: Running Locally

To run the model locally, you can follow these steps:

  1. Install Required Libraries:

    pip install torch torchvision timm open_clip_torch
    
  2. Load Model with OpenCLIP:

    import torch
    from urllib.request import urlopen
    from PIL import Image
    from open_clip import create_model_from_pretrained, get_tokenizer
    
    model, preprocess = create_model_from_pretrained('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
    tokenizer = get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
    
    image = Image.open(urlopen('IMAGE_URL'))
    image = preprocess(image).unsqueeze(0)
    
    labels_list = ["a dog", "a cat", "a donut", "a beignet"]
    text = tokenizer(labels_list, context_length=model.context_length)
    
    with torch.no_grad(), torch.cuda.amp.autocast():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
    
  3. Running with TIMM:

    import timm
    image = Image.open(urlopen('IMAGE_URL'))
    
    model = timm.create_model('vit_so400m_patch14_siglip_384', pretrained=True, num_classes=0)
    model = model.eval()
    
    # Model-specific transforms
    data_config = timm.data.resolve_model_data_config(model)
    transforms = timm.data.create_transform(**data_config, is_training=False)
    
    output = model(transforms(image).unsqueeze(0))
    
  4. Use a Cloud GPU: For better performance, consider using cloud services like AWS EC2, Google Cloud, or Azure for GPU resources.

License

The ViT-SO400M-14-SigLIP-384 model is licensed under the Apache-2.0 License.

More Related APIs in Zero Shot Image Classification