Mamba Vision S 1 K
nvidiaIntroduction
MambaVision is a hybrid model for computer vision that combines Mamba and Transformer architectures. It enhances Mamba's formulation for efficient visual feature modeling and integrates Vision Transformers for improved long-range spatial dependency capture. The model achieves state-of-the-art (SOTA) performance in terms of Top-1 accuracy and throughput.
Architecture
The MambaVision model features a hierarchical architecture that leverages the strengths of both Mamba and Transformer models. The architecture includes self-attention blocks in the final layers to effectively capture spatial dependencies. It offers two main variants for image classification and feature extraction, catering to different design and performance needs.
Training
The MambaVision model is trained on the ILSVRC/imagenet-1k dataset. Its architecture allows for flexible input resolutions, and it includes comprehensive ablation studies to optimize the integration of Vision Transformers with Mamba. The model's design is tailored to achieve a balance between accuracy and computational efficiency.
Guide: Running Locally
To use MambaVision for image classification or feature extraction, follow these steps:
-
Install the required package:
pip install mambavision
-
Set up the model for image classification:
from transformers import AutoModelForImageClassification model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-S-1K", trust_remote_code=True)
-
Prepare the image and perform inference:
from PIL import Image from timm.data.transforms_factory import create_transform import requests url = 'http://images.cocodataset.org/val2017/000000020247.jpg' image = Image.open(requests.get(url, stream=True).raw) transform = create_transform(input_size=(3, 224, 224), is_training=False) inputs = transform(image).unsqueeze(0).cuda() outputs = model(inputs)
-
Optional - Feature Extraction:
from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/MambaVision-S-1K", trust_remote_code=True) out_avg_pool, features = model(inputs)
-
Hardware Recommendation: For optimal performance, consider using a cloud GPU service such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.
License
MambaVision is distributed under the NVIDIA Source Code License-NC. More details can be found in the license documentation.