Aria
rhymes-aiIntroduction
Aria is an advanced multimodal model by RHYMES.AI, designed for handling diverse tasks across video, document, language, and coding domains. It features a mixture-of-experts architecture with efficient visual input processing and supports a long multimodal context window.
Architecture
- SoTA Multimodal Performance: Aria excels in video and document understanding, offering superior performance on a variety of tasks.
- Lightweight and Fast: With 3.9 billion activated parameters per token, it processes visual inputs of varying sizes and aspect ratios efficiently.
- Long Context Window: Capable of processing up to 64K tokens, enabling rapid video captioning.
Training
Aria's training involves native multimodal pre-training for its base models, Aria-Base-8K and Aria-Base-64K, which are available for research. It leverages a mixture-of-expert model to enhance its processing capabilities across different modalities.
Guide: Running Locally
Basic Steps
-
Installation: Use the following commands to install necessary libraries:
pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow pip install flash-attn --no-build-isolation pip install grouped_gemm==0.1.6
-
Inference: The model requires an A100 (80GB) GPU for loading with bfloat16 precision. Use the code snippet below for inference:
import requests import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor model_id_or_path = "rhymes-ai/Aria" model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True) processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True) image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png" image = Image.open(requests.get(image_path, stream=True).raw) messages = [{"role": "user", "content": [{"text": None, "type": "image"}, {"text": "what is the image?", "type": "text"}]}] text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=text, images=image, return_tensors="pt") inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16): output = model.generate(**inputs, max_new_tokens=500, stop_strings=["<|im_end|>"], tokenizer=processor.tokenizer, do_sample=True, temperature=0.9) output_ids = output[0][inputs["input_ids"].shape[1]:] result = processor.decode(output_ids, skip_special_tokens=True) print(result)
Cloud GPUs
Consider using cloud GPU services such as AWS, Google Cloud, or Azure for running the model efficiently.
License
Aria is released under the Apache-2.0 license, allowing wide usage and modification with proper attribution.