aimv2 large patch14 224 LLM Model

Introduction

We introduce the AIMv2 family of vision models, which are pre-trained using a multimodal autoregressive objective. AIMv2 models are designed to be simple, effective, and scalable, showing superior performance over OAI CLIP and SigLIP on most multimodal benchmarks. They also surpass DINOv2 in tasks such as open-vocabulary object detection and referring expression comprehension. Notably, the AIMv2-3B model achieves an 89.5% accuracy on ImageNet with a frozen trunk.

Architecture

AIMv2 models are pre-trained with a focus on multimodal understanding, particularly excelling in vision tasks. The architecture is optimized for both simplicity and efficiency, allowing for effective scaling and impressive performance across various benchmarks.

Training

AIMv2 models are trained using a multimodal autoregressive objective, which enhances their ability to understand and process data from multiple modalities. This training approach contributes to their high performance across a range of vision and classification tasks.

Guide: Running Locally

PyTorch

Install Dependencies: Ensure you have transformers, requests, and PIL installed.

Load Model and Processor:

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-224")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224", trust_remote_code=True)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

Install Dependencies: Ensure you have transformers, requests, and PIL installed.

Load Model and Processor:

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-224")
model = FlaxAutoModel.from_pretrained("apple/aimv2-large-patch14-224", trust_remote_code=True)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

Cloud GPUs

For better performance, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure, which provide scalable and powerful computing resources.

License

The AIMv2 model is licensed under the apple-ascl license.

More Related APIs in Image Feature Extraction