Mamba Vision S 1 K

nvidia

Introduction

MambaVision is a hybrid model for computer vision that combines Mamba and Transformer architectures. It enhances Mamba's formulation for efficient visual feature modeling and integrates Vision Transformers for improved long-range spatial dependency capture. The model achieves state-of-the-art (SOTA) performance in terms of Top-1 accuracy and throughput.

Architecture

The MambaVision model features a hierarchical architecture that leverages the strengths of both Mamba and Transformer models. The architecture includes self-attention blocks in the final layers to effectively capture spatial dependencies. It offers two main variants for image classification and feature extraction, catering to different design and performance needs.

Training

The MambaVision model is trained on the ILSVRC/imagenet-1k dataset. Its architecture allows for flexible input resolutions, and it includes comprehensive ablation studies to optimize the integration of Vision Transformers with Mamba. The model's design is tailored to achieve a balance between accuracy and computational efficiency.

Guide: Running Locally

To use MambaVision for image classification or feature extraction, follow these steps:

  1. Install the required package:

    pip install mambavision
    
  2. Set up the model for image classification:

    from transformers import AutoModelForImageClassification
    model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-S-1K", trust_remote_code=True)
    
  3. Prepare the image and perform inference:

    from PIL import Image
    from timm.data.transforms_factory import create_transform
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    transform = create_transform(input_size=(3, 224, 224), is_training=False)
    inputs = transform(image).unsqueeze(0).cuda()
    outputs = model(inputs)
    
  4. Optional - Feature Extraction:

    from transformers import AutoModel
    model = AutoModel.from_pretrained("nvidia/MambaVision-S-1K", trust_remote_code=True)
    out_avg_pool, features = model(inputs)
    
  5. Hardware Recommendation: For optimal performance, consider using a cloud GPU service such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

MambaVision is distributed under the NVIDIA Source Code License-NC. More details can be found in the license documentation.

More Related APIs in Image Feature Extraction