Long V I T A 1 M
VITA-MLLMLong-VITA-1M Model Documentation
Introduction
Long-VITA-1M is an advanced long-context visual language model capable of handling over 1 million tokens. It is developed by VITA-MLLM and provides strong capabilities in image and video understanding.
Architecture
Long-VITA is initially trained on Ascend NPU utilizing MindSpeed. For inference and evaluation on Nvidia GPUs, the model is implemented on Megatron with Transformer Engine. The converted weights are available on Hugging Face.
Training
The model is trained using the Long-VITA-Training-Data dataset. The training utilizes advanced hardware acceleration provided by Ascend NPU, which helps in managing the extensive token capabilities of the model.
Guide: Running Locally
-
Clone the Repository: Access the GitHub repository at Long-VITA GitHub.
-
Install Dependencies: Ensure all required libraries and frameworks are installed.
-
Download Model Weights: Obtain the model weights from Hugging Face.
-
Run Inference: Utilize Nvidia GPUs with Megatron and Transformer Engine for optimal performance.
Suggested Cloud GPUs: Consider using cloud services like AWS EC2 with GPU instances or Google Cloud Platform for efficient processing.
License
The Long-VITA-1M model is licensed under the Apache 2.0 License. Users must comply with the Acceptable Use Policy, which prohibits harmful activities, violations of law, and unethical use cases.