Long-VITA-1M Model Documentation

Introduction

Long-VITA-1M is an advanced long-context visual language model capable of handling over 1 million tokens. It is developed by VITA-MLLM and provides strong capabilities in image and video understanding.

Architecture

Long-VITA is initially trained on Ascend NPU utilizing MindSpeed. For inference and evaluation on Nvidia GPUs, the model is implemented on Megatron with Transformer Engine. The converted weights are available on Hugging Face.

Training

The model is trained using the Long-VITA-Training-Data dataset. The training utilizes advanced hardware acceleration provided by Ascend NPU, which helps in managing the extensive token capabilities of the model.

Guide: Running Locally

Clone the Repository: Access the GitHub repository at Long-VITA GitHub.
Install Dependencies: Ensure all required libraries and frameworks are installed.
Download Model Weights: Obtain the model weights from Hugging Face.
Run Inference: Utilize Nvidia GPUs with Megatron and Transformer Engine for optimal performance.

Suggested Cloud GPUs: Consider using cloud services like AWS EC2 with GPU instances or Google Cloud Platform for efficient processing.

License

The Long-VITA-1M model is licensed under the Apache 2.0 License. Users must comply with the Acceptable Use Policy, which prohibits harmful activities, violations of law, and unethical use cases.

More Related APIs

Long V I T A 1 M

Long-VITA-1M Model Documentation

Introduction

Architecture

Training

Guide: Running Locally

License