Introduction

Valley is a state-of-the-art multimodal large model developed by ByteDance, capable of processing text, images, and video data. It excels in e-commerce and short-video benchmarks and demonstrates outstanding performance in OpenCompass tests compared to models of similar scale.

Architecture

The foundational Valley model aligns with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to build the projector. The final version introduces an additional VisionEncoder that is parallel to the original visual tokens, enhancing performance in extreme scenarios using the Qwen2vl VisionEncoder.

Training

The model's training details include the integration of various components, such as LargeMLP and ConvAdapter, to improve its multimodal processing capabilities. The use of advanced VisionEncoder allows flexible token adjustment, further enhancing the model's efficiency and accuracy in diverse applications.

Guide: Running Locally

  1. Environment Setup

    • Install the necessary packages using the following commands:
      pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
      pip install -r requirements.txt
      
  2. Hardware Suggestions

    • For optimal performance, it is recommended to use cloud GPU services such as AWS, Google Cloud, or Azure that offer high-performance GPUs to handle the model's computational demands.

License

All open-source models are licensed under the Apache-2.0 license, allowing for wide usage and modification within the terms of the license.

More Related APIs