Long V I T A 128 K
VITA-MLLMIntroduction
Long-VITA-128K is a robust long-context visual language model capable of supporting over 1 million tokens. It is trained on Ascend NPU with MindSpeed and can be implemented on Nvidia GPUs using Megatron with Transformer Engine.
Architecture
The model utilizes a Transformer-based architecture, optimized for handling extensive sequences of tokens. It is designed to process and understand complex visual and textual data, facilitating improved comprehension in image and video contexts.
Training
Long-VITA-128K is trained using the VITA-MLLM/Long-VITA-Training-Data dataset. The training process leverages Ascend NPU to enhance performance and efficiency, while the model's capabilities are further expanded by implementing it on Megatron with Transformer Engine for GPU inference.
Guide: Running Locally
To run Long-VITA-128K locally, follow these steps:
-
Clone the Repository:
- Clone the GitHub repository from Long-VITA on GitHub.
-
Set Up Environment:
- Ensure you have the required dependencies and libraries installed, such as PyTorch and Megatron.
-
Download Model Weights:
- Access the converted weights from Hugging Face Model Hub.
-
Run Inference:
- Utilize the GPU resources for efficient inference. Cloud GPUs such as AWS EC2, Google Cloud, or Azure are recommended for optimal performance.
-
Evaluate Results:
- Test the model with visual and textual data to evaluate its image and video understanding capabilities.
License
Long-VITA-128K is licensed under the Apache-2.0 license. Usage of the model is subject to compliance with the Acceptable Use Policy, which prohibits activities such as generating harmful content, violating laws, or exploiting safety vulnerabilities. The policy aims to ensure ethical and safe application of the model's capabilities.