Introduction

Aria is an advanced multimodal model by RHYMES.AI, designed for handling diverse tasks across video, document, language, and coding domains. It features a mixture-of-experts architecture with efficient visual input processing and supports a long multimodal context window.

Architecture

  • SoTA Multimodal Performance: Aria excels in video and document understanding, offering superior performance on a variety of tasks.
  • Lightweight and Fast: With 3.9 billion activated parameters per token, it processes visual inputs of varying sizes and aspect ratios efficiently.
  • Long Context Window: Capable of processing up to 64K tokens, enabling rapid video captioning.

Training

Aria's training involves native multimodal pre-training for its base models, Aria-Base-8K and Aria-Base-64K, which are available for research. It leverages a mixture-of-expert model to enhance its processing capabilities across different modalities.

Guide: Running Locally

Basic Steps

  1. Installation: Use the following commands to install necessary libraries:

    pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
    pip install flash-attn --no-build-isolation
    pip install grouped_gemm==0.1.6
    
  2. Inference: The model requires an A100 (80GB) GPU for loading with bfloat16 precision. Use the code snippet below for inference:

    import requests
    import torch
    from PIL import Image
    from transformers import AutoModelForCausalLM, AutoProcessor
    
    model_id_or_path = "rhymes-ai/Aria"
    
    model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
    processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
    
    image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
    image = Image.open(requests.get(image_path, stream=True).raw)
    
    messages = [{"role": "user", "content": [{"text": None, "type": "image"}, {"text": "what is the image?", "type": "text"}]}]
    
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=text, images=image, return_tensors="pt")
    inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
        output = model.generate(**inputs, max_new_tokens=500, stop_strings=["<|im_end|>"], tokenizer=processor.tokenizer, do_sample=True, temperature=0.9)
        output_ids = output[0][inputs["input_ids"].shape[1]:]
        result = processor.decode(output_ids, skip_special_tokens=True)
    
    print(result)
    

Cloud GPUs

Consider using cloud GPU services such as AWS, Google Cloud, or Azure for running the model efficiently.

License

Aria is released under the Apache-2.0 license, allowing wide usage and modification with proper attribution.

More Related APIs in Image Text To Text