siglip so400m patch14 384 LLM Model

Introduction

SIGLIP is a shape-optimized version of CLIP, developed by Google, and pre-trained on the WebLI dataset. It uses a sigmoid loss function for language-image pre-training, which improves performance on image-text pairs. This model is suitable for tasks like zero-shot image classification and image-text retrieval.

Architecture

The SIGLIP model employs the SoViT-400m architecture, a shape-optimized design based on scaling laws for compute-optimal model design. It enhances the CLIP architecture by refining the loss function for better performance at varying batch sizes.

Training

Training Data: Pre-trained on the WebLI dataset, providing a robust foundation for image-text tasks.
Preprocessing: Images are resized to 384x384, normalized with mean and standard deviation of 0.5 across RGB channels. Texts are tokenized and padded to 64 tokens.
Compute: Trained on 16 TPU-v4 chips over three days.
Evaluation: Outperforms CLIP in certain benchmarks, as indicated in the evaluation table in the source documentation.

Guide: Running Locally

To run SIGLIP locally:

Install Dependencies: Ensure you have Python and the transformers library installed.

Load Model and Processor:

from transformers import AutoProcessor, AutoModel
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

Prepare Input: Use an image URL and desired text labels.
Inference: Use the model to classify images with the processor, applying the sigmoid to logits for probabilities.

For those with access to cloud services, utilizing cloud GPUs on platforms like Google Cloud or AWS can significantly speed up inference and training tasks.

License

SIGLIP is released under the Apache 2.0 License, allowing for both commercial and non-commercial use with attribution.

More Related APIs in Zero Shot Image Classification