swinv2 large patch4 window12to24 192to384 22kto1k ft
microsoftIntroduction
The Swin Transformer V2 is a vision transformer model developed by Microsoft, pre-trained on the ImageNet-21k dataset and fine-tuned on ImageNet-1k at a resolution of 384x384. It introduces improvements for more efficient and scalable image processing, designed to be a general-purpose backbone for tasks such as image classification and dense recognition.
Architecture
The Swin Transformer V2 model constructs hierarchical feature maps by merging image patches as it progresses through deeper layers. It achieves linear computational complexity relative to input image size by calculating self-attention within local windows rather than globally, in contrast to previous vision transformers. Key advancements include:
- Residual-post-norm combined with cosine attention to enhance training stability.
- Log-spaced continuous position bias for transferring pre-trained models to high-resolution tasks.
- SimMIM self-supervised pre-training to minimize the need for large labeled datasets.
Training
The model was initially trained on the large-scale ImageNet-21k dataset and then fine-tuned on the ImageNet-1k dataset. This training approach ensures that the model is versatile and robust for various image classification tasks.
Guide: Running Locally
To use the Swin Transformer V2 model for image classification, follow these steps:
- Install the Transformers library: Ensure you have the
transformers
library installed in your Python environment. You can install it using:pip install transformers
- Load the model and processor:
from transformers import AutoImageProcessor, AutoModelForImageClassification from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("microsoft/swinv2-large-patch4-window12to24-192to384-22kto1k-ft") model = AutoModelForImageClassification.from_pretrained("microsoft/swinv2-large-patch4-window12to24-192to384-22kto1k-ft")
- Process the image and make predictions:
inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
For accelerated computing, consider using cloud services with GPU support such as AWS, Google Cloud, or Azure.
License
The Swin Transformer V2 model is available under the Apache 2.0 License. This permits use, distribution, and modification under specific terms and conditions.