Introduction

CLIP-RSICD-V2 is a fine-tuned version of OpenAI's CLIP model, specifically optimized for zero-shot image classification, text-to-image, and image-to-image retrieval tasks involving remote sensing images. It was developed by the FLAX community and released in July 2021.

Architecture

The model employs a ViT-B/32 Transformer architecture as its image encoder and a masked self-attention Transformer as its text encoder. Both encoders are trained to amplify the similarity of image-text pairs using contrastive loss. Checkpoints are available with performance metrics for zero-shot classification.

Training

The fine-tuning was conducted using a batch size of 1024, the Adafactor optimizer, and a linear learning rate schedule with a peak at 1e-4. The model training used a TPU-v3-8, and detailed training logs are accessible on WandB. The training script is available on GitHub for those interested in reproducing the process.

Guide: Running Locally

  1. Install dependencies: Make sure you have the transformers library installed.
  2. Load the model and processor:
    from transformers import CLIPProcessor, CLIPModel
    model = CLIPModel.from_pretrained("flax-community/clip-rsicd-v2")
    processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd-v2")
    
  3. Prepare your image and labels:
    from PIL import Image
    import requests
    url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    labels = ["residential area", "playground", "stadium", "forest", "airport"]
    inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
    
  4. Run inference:
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    for l, p in zip(labels, probs[0]):
        print(f"{l:<16} {p:.4f}")
    
  5. Cloud GPUs: For intensive tasks, consider using cloud services with GPU support such as Google Colab or AWS.

License

For license information, refer to the Hugging Face model page and the associated repositories, which typically include licensing details in their README or LICENSE file.

More Related APIs in Zero Shot Image Classification