Megrez 3 B Omni

Infinigence

Introduction

Megrez-3B-Omni is a multi-modal understanding model developed by Infinigence AI, extending the Megrez-3B-Instruct language model. It is capable of understanding and analyzing images, text, and audio with high precision.

Architecture

Megrez-3B-Omni integrates various modules for different modalities:

  • Language Module: Llama-2 with Generalized Query Attention (GQA)
  • Vision Module: SigLip-SO400M
  • Audio Module: Whisper-large-v3 (encoder-only)
  • Total Parameters: 4 billion
  • Supported Languages: Chinese and English

Training

The model has been trained across three modalities:

  • Image Understanding: Uses SigLip-400M to construct image tokens, achieving high scores on benchmarks like OpenCompass.
  • Language Understanding: Maintains text processing capabilities with less than 2% variance in accuracy compared to the single-modal version.
  • Audio Understanding: Utilizes Qwen2-Audio/whisper-large-v3 for audio input, supporting both Chinese and English speech inputs.

Guide: Running Locally

To run Megrez-3B-Omni locally, follow these steps:

  1. Installation:

  2. Inference Example:

    • Use the transformers library for model inference. Load the model using AutoModelForCausalLM and interact with it using text, images, and audio inputs.
    import torch
    from transformers import AutoModelForCausalLM
    
    path = "{{PATH_TO_PRETRAINED_MODEL}}"  # Specify the model path.
    
    model = (
        AutoModelForCausalLM.from_pretrained(
            path,
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
        )
        .eval()
        .cuda()
    )
    
    # Example of using text and image
    messages = [{"role": "user", "content": {"text": "Describe the image.", "image": "./data/sample_image.jpg"}}]
    
    response = model.chat(messages, sampling=False, max_new_tokens=100, temperature=0)
    print(response)
    
  3. Hardware Recommendations:

    • Consider using cloud GPUs such as NVIDIA A100 or H100 for efficient model inference.

License

Megrez-3B-Omni is open-sourced under the Apache-2.0 License. Users are advised to be cautious of the model's potential for generating hallucinations and to ensure compliance with data and safety standards. The developers disclaim responsibility for any issues arising from the use of the model.

More Related APIs