swinv2 tiny patch4 window8 256

microsoft

Introduction

The Swin Transformer V2 (Tiny-sized model) is a pre-trained model designed for image classification tasks, using the ImageNet-1k dataset at a resolution of 256x256. Introduced in the paper "Swin Transformer V2: Scaling Up Capacity and Resolution" by Liu et al., this model offers a general-purpose backbone suitable for both image classification and dense recognition tasks.

Architecture

Swin Transformer V2 constructs hierarchical feature maps by merging image patches in deeper layers. It computes self-attention only within each local window, leading to linear computational complexity relative to the input image size. The model introduces three key improvements: residual-post-norm with cosine attention for training stability, log-spaced continuous position bias for effective model transfer to high-resolution tasks, and a self-supervised pre-training method called SimMIM that reduces the need for large labeled datasets.

Swin Transformer Architecture

Training

The Swin Transformer V2 utilizes a self-supervised pre-training method, SimMIM, allowing it to perform well with fewer labeled images during training. The model is pre-trained on low-resolution images and can be effectively transferred to downstream tasks involving high-resolution inputs.

Guide: Running Locally

To use the Swin Transformer V2 model for image classification on a local machine, follow these steps:

  1. Install the transformers library:

    pip install transformers
    
  2. Use the provided Python code to classify an image:

    from transformers import AutoImageProcessor, AutoModelForImageClassification
    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = AutoImageProcessor.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")
    model = AutoModelForImageClassification.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. For optimal performance, consider using cloud GPUs available through platforms like AWS, Google Cloud, or Azure.

License

This model is released under the Apache 2.0 license, which allows for both personal and commercial use with proper attribution.

More Related APIs in Image Classification