Introduction

Aria-Chat is a multimodal model optimized for open-ended and multi-round dialogs, aiming to provide a seamless open-source chat experience. It has enhanced reliability in generating long outputs and improved multi-lingual capabilities.

Architecture

Aria-Chat utilizes a total of 25.3 billion parameters and supports multimodal conversations. It is capable of handling both text and images in conversations, making it suitable for diverse applications.

Training

The model was evaluated on WildVision-Bench, showing significant improvements in real-world applications compared to other benchmarks. The focus is on optimizing for actual use cases rather than solely on benchmark scores.

Guide: Running Locally

Installation

To run Aria-Chat locally, you need to install the following Python packages:

pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
pip install flash-attn --no-build-isolation
pip install grouped_gemm==0.1.6

Inference

You can load the model using one A100 (80GB) GPU with bfloat16 precision. Here is a basic usage example:

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_id_or_path = "rhymes-ai/Aria-Chat"

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)

image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
image = Image.open(requests.get(image_path, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=500,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=True,
        temperature=0.9,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

Cloud GPUs

For optimal performance, consider using a cloud service that offers A100 GPUs, which are well-suited for handling the model's requirements.

License

The Aria-Chat model is licensed under the Apache 2.0 License, allowing for wide usage and adaptation with proper attribution.

More Related APIs in Image Text To Text