swin base patch4 window7 224 in22k

microsoft

Introduction

The Swin Transformer model is a large-sized Vision Transformer pre-trained on the ImageNet-21k dataset, which consists of 14 million images across 21,841 classes. This model was introduced in the paper "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" by Liu et al. It is designed to serve as a versatile backbone for image classification and dense recognition tasks.

Architecture

The Swin Transformer builds hierarchical feature maps by merging image patches in deeper layers. It has linear computational complexity relative to input image size due to its self-attention mechanism being computed within local windows. This contrasts with earlier vision Transformers that have quadratic complexity and produce single low-resolution feature maps due to global self-attention computation.

Training

The model is pre-trained on the extensive ImageNet-21k dataset, allowing it to recognize a wide variety of image classes. It forms part of a series of models aimed at enhancing image classification performance through hierarchical vision structures.

Guide: Running Locally

To use this model for image classification, follow these steps:

  1. Install Dependencies: Ensure you have the transformers library from Hugging Face and PIL installed. Use pip install transformers pillow if needed.

  2. Load the Model and Processor:

    from transformers import AutoImageProcessor, SwinForImageClassification
    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = AutoImageProcessor.from_pretrained("microsoft/swin-base-patch4-window7-224-in22k")
    model = SwinForImageClassification.from_pretrained("microsoft/swin-base-patch4-window7-224-in22k")
    
  3. Prepare the Input:

    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  4. Cloud GPUs: For enhanced performance, especially with large models, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The Swin Transformer model is released under the Apache 2.0 license, which allows for both commercial and non-commercial use with proper attribution.

More Related APIs in Image Classification