clip rsicd v2
flax-communityIntroduction
CLIP-RSICD-V2 is a fine-tuned version of OpenAI's CLIP model, specifically optimized for zero-shot image classification, text-to-image, and image-to-image retrieval tasks involving remote sensing images. It was developed by the FLAX community and released in July 2021.
Architecture
The model employs a ViT-B/32 Transformer architecture as its image encoder and a masked self-attention Transformer as its text encoder. Both encoders are trained to amplify the similarity of image-text pairs using contrastive loss. Checkpoints are available with performance metrics for zero-shot classification.
Training
The fine-tuning was conducted using a batch size of 1024, the Adafactor optimizer, and a linear learning rate schedule with a peak at 1e-4. The model training used a TPU-v3-8, and detailed training logs are accessible on WandB. The training script is available on GitHub for those interested in reproducing the process.
Guide: Running Locally
- Install dependencies: Make sure you have the
transformers
library installed. - Load the model and processor:
from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("flax-community/clip-rsicd-v2") processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd-v2")
- Prepare your image and labels:
from PIL import Image import requests url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg" image = Image.open(requests.get(url, stream=True).raw) labels = ["residential area", "playground", "stadium", "forest", "airport"] inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
- Run inference:
outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) for l, p in zip(labels, probs[0]): print(f"{l:<16} {p:.4f}")
- Cloud GPUs: For intensive tasks, consider using cloud services with GPU support such as Google Colab or AWS.
License
For license information, refer to the Hugging Face model page and the associated repositories, which typically include licensing details in their README or LICENSE file.