uniformer_image

Sense-X

Introduction

The UniFormer model is a Vision Transformer designed for effective image classification. It combines the strengths of convolution and self-attention in a concise format, achieving high accuracy on various visual tasks. The model excels in tasks like video classification, object detection, semantic segmentation, and pose estimation.

Architecture

The UniFormer architecture integrates convolutional and self-attention mechanisms. It utilizes local Multi-Head Relational Attention (MHRA) in shallow layers to reduce computational load and global MHRA in deeper layers to capture global token relationships. This approach enables the model to perform robustly across different tasks without additional training data.

Training

UniFormer models are trained on ImageNet at a resolution of 224x224. They achieve 86.3% top-1 accuracy on ImageNet-1K classification and perform well on other datasets, such as Kinetics-400/600 and COCO. The architecture supports a variety of tasks with minimal pre-training on ImageNet-1K.

Guide: Running Locally

To run the UniFormer model locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Install additional packages like torchvision and transformers.

  2. Download Model: Use the hf_hub_download function to obtain the model weights from Hugging Face's model hub.

  3. Load and Prepare: Load the model and set it to evaluation mode. Prepare your input image by resizing, cropping, and normalizing it.

  4. Inference: Run the model on the input image to get predictions. The model outputs one of the 1000 ImageNet classes.

  5. Hardware: Consider using cloud GPUs from providers like AWS, Google Cloud, or Azure for efficient computation.

from uniformer import uniformer_small
from imagenet_class_index import imagenet_classnames

model = uniformer_small()
model_path = hf_hub_download(repo_id="Sense-X/uniformer_image", filename="uniformer_small_in1k.pth")
state_dict = torch.load(model_path, map_location='cpu')
model.load_state_dict(state_dict)
model = model.to(device)
model = model.eval()

image = img
image_transform = T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
image = image_transform(image)
image = image.unsqueeze(0)

prediction = model(image)
predicted_class_idx = prediction.flatten().argmax(-1).item()
print("Predicted class:", imagenet_classnames[str(predicted_class_idx)][1])

License

The UniFormer model is released under the MIT license, allowing for flexible use and modification.

More Related APIs in Image Classification