Ross-QWEN2-7B Model Documentation

Introduction

Ross is an open-source multimodal chatbot developed by fine-tuning the Qwen2/Vicuna model. It is designed to follow multimodal instructions using an auto-regressive approach based on the transformer architecture. Ross includes an image reconstruction objective to enhance its multimodal comprehension capabilities.

Architecture

Ross leverages the Qwen2-7B-Instruct and google/siglip-so400m-patch14-384 base models. Its architecture incorporates a transformer framework, which facilitates robust language understanding and image processing capabilities.

Training

Training data for Ross includes datasets from lmms-lab/LLaVA-OneVision-Data and nyu-visionx/Cambrian-Alignment. Additional packages are required for training, which are installable via pip.

Guide: Running Locally

  1. Clone the Repository
    Clone the Ross repository from GitHub and navigate to the LLaVA folder:

    git clone https://github.com/Haochen-Wang409/ross.git
    cd ross
    
  2. Set Up Environment
    Create a new Conda environment and install required packages:

    conda create -n ross python=3.10 -y
    conda activate ross
    pip install --upgrade pip  # Enable PEP 660 support
    pip install -e .
    
  3. Install Training Packages
    For training purposes, install additional packages:

    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
    
  4. Usage
    Import necessary modules and load the pretrained model:

    import torch
    from PIL import Image
    from ross.model.builder import load_pretrained_model
    from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
    from ross.eval.run_llava import eval_model
    
    model_path = "HaochenWang/ross-qwen2-7b"
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path=model_path,
        model_base=None,
        model_name=get_model_name_from_path(model_path)
    )
    
    model.cuda()
    model.eval()
    
    image = Image.open("...")
    prompt = "..."
    
    images_tensor = process_images(
        images,
        image_processor,
        model.config,
    ).cuda()
    
    input_ids = tokenizer_image_token(
        prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
    ).unsqueeze(0).cuda()
    
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=images_tensor,
            do_sample=True,
            temperature=0.8,
            top_p=0.7,
            top_k=20,
            num_beams=5,
            max_new_tokens=512,
            use_cache=True,
        )
    
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    print(outputs)
    

Cloud GPUs
For optimal performance, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure for intensive computation tasks.

License

Ross is distributed under the Apache-2.0 license, which allows for open-source usage and modification.

More Related APIs