Megrez 3 B Omni
InfinigenceIntroduction
Megrez-3B-Omni is a multi-modal understanding model developed by Infinigence AI, extending the Megrez-3B-Instruct language model. It is capable of understanding and analyzing images, text, and audio with high precision.
Architecture
Megrez-3B-Omni integrates various modules for different modalities:
- Language Module: Llama-2 with Generalized Query Attention (GQA)
- Vision Module: SigLip-SO400M
- Audio Module: Whisper-large-v3 (encoder-only)
- Total Parameters: 4 billion
- Supported Languages: Chinese and English
Training
The model has been trained across three modalities:
- Image Understanding: Uses SigLip-400M to construct image tokens, achieving high scores on benchmarks like OpenCompass.
- Language Understanding: Maintains text processing capabilities with less than 2% variance in accuracy compared to the single-modal version.
- Audio Understanding: Utilizes Qwen2-Audio/whisper-large-v3 for audio input, supporting both Chinese and English speech inputs.
Guide: Running Locally
To run Megrez-3B-Omni locally, follow these steps:
-
Installation:
- Set up the environment by following instructions in the Infini-Megrez-Omni GitHub repository.
-
Inference Example:
- Use the
transformers
library for model inference. Load the model usingAutoModelForCausalLM
and interact with it using text, images, and audio inputs.
import torch from transformers import AutoModelForCausalLM path = "{{PATH_TO_PRETRAINED_MODEL}}" # Specify the model path. model = ( AutoModelForCausalLM.from_pretrained( path, trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", ) .eval() .cuda() ) # Example of using text and image messages = [{"role": "user", "content": {"text": "Describe the image.", "image": "./data/sample_image.jpg"}}] response = model.chat(messages, sampling=False, max_new_tokens=100, temperature=0) print(response)
- Use the
-
Hardware Recommendations:
- Consider using cloud GPUs such as NVIDIA A100 or H100 for efficient model inference.
License
Megrez-3B-Omni is open-sourced under the Apache-2.0 License. Users are advised to be cautious of the model's potential for generating hallucinations and to ensure compliance with data and safety standards. The developers disclaim responsibility for any issues arising from the use of the model.