aimv2 large patch14 224
appleIntroduction
We introduce the AIMv2 family of vision models, which are pre-trained using a multimodal autoregressive objective. AIMv2 models are designed to be simple, effective, and scalable, showing superior performance over OAI CLIP and SigLIP on most multimodal benchmarks. They also surpass DINOv2 in tasks such as open-vocabulary object detection and referring expression comprehension. Notably, the AIMv2-3B model achieves an 89.5% accuracy on ImageNet with a frozen trunk.
Architecture
AIMv2 models are pre-trained with a focus on multimodal understanding, particularly excelling in vision tasks. The architecture is optimized for both simplicity and efficiency, allowing for effective scaling and impressive performance across various benchmarks.
Training
AIMv2 models are trained using a multimodal autoregressive objective, which enhances their ability to understand and process data from multiple modalities. This training approach contributes to their high performance across a range of vision and classification tasks.
Guide: Running Locally
PyTorch
- Install Dependencies: Ensure you have
transformers
,requests
, andPIL
installed. - Load Model and Processor:
import requests from PIL import Image from transformers import AutoImageProcessor, AutoModel url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-224") model = AutoModel.from_pretrained("apple/aimv2-large-patch14-224", trust_remote_code=True) inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs)
JAX
- Install Dependencies: Ensure you have
transformers
,requests
, andPIL
installed. - Load Model and Processor:
import requests from PIL import Image from transformers import AutoImageProcessor, FlaxAutoModel url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-224") model = FlaxAutoModel.from_pretrained("apple/aimv2-large-patch14-224", trust_remote_code=True) inputs = processor(images=image, return_tensors="jax") outputs = model(**inputs)
Cloud GPUs
For better performance, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure, which provide scalable and powerful computing resources.
License
The AIMv2 model is licensed under the apple-ascl license.