Apollo L M Ms Apollo 1_5 B t32

GoodiesHere

Apollo-LMMs-Apollo-1_5B-t32

Introduction

Apollo is a family of Large Multimodal Models (LMMs) designed for advanced video understanding tasks. These models excel in long-form video comprehension, temporal reasoning, complex video question-answering, and conducting multi-turn conversations based on video content. Notably, Apollo models achieve high performance with fewer parameters, outperforming many larger models.

Architecture

Apollo's architecture includes:

  • Scaling Consistency: Effective transfer of design decisions from smaller to larger models reduces computational costs.
  • Efficient Video Sampling: Utilizes frames-per-second (fps) sampling and advanced token resampling strategies like Perceiver for enhanced temporal perception.
  • Encoder Synergies: Combines SigLIP-SO400M (image) with InternVideo2 (video) for robust representation, surpassing single encoder performance on temporal tasks.
  • ApolloBench: A benchmark for evaluating true video understanding capabilities, offering a 41x faster evaluation process.

Training

Apollo models are trained to handle extensive video content, strategically balancing speed and accuracy. They excel in tasks requiring comprehensive video understanding, benefiting from advanced temporal perception techniques and encoder synergies.

Guide: Running Locally

Basic Steps

  1. Installation:

    pip install -e .
    pip install flash-attn --no-build-isolation
    
  2. Inference Example:

    import torch
    from transformers import AutoModelForCausalLM
    from apollo.mm_utils import (
        KeywordsStoppingCriteria,
        tokenizer_mm_token,
        ApolloMMLoader
    )
    from huggingface_hub import snapshot_download
    
    model_url = "Apollo-LMMs/Apollo-3B-t32"
    model_path = snapshot_download(model_url, repo_type="model")
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        trust_remote_code=True,
        low_cpu_mem_usage=True
    ).to(device=device, dtype=torch.bfloat16)
    
    tokenizer = model.tokenizer
    vision_processors = model.vision_tower.vision_processor
    config = model.config
    num_repeat_token = config.mm_connector_cfg['num_output_tokens']
    mm_processor = ApolloMMLoader(
        vision_processors,
        config.clip_duration,
        frames_per_clip=4,
        clip_sampling_ratio=0.65,
        model_max_length=config.model_max_length,
        device=device,
        num_repeat_token=num_repeat_token
    )
    
    video_path = "path/to/video.mp4"
    question = "Describe this video in detail"
    mm_data, replace_string = mm_processor.load_video(video_path)
    
    conv = conv_templates["qwen_2"].copy()
    conv.append_message(conv.roles[0], replace_string + "\n\n" + question)
    conv.append_message(conv.roles[1], None)
    
    prompt = conv.get_prompt()
    input_ids = tokenizer_mm_token(prompt, tokenizer, return_tensors="pt").unsqueeze(0).to(device)
    
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)
    
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            vision_input=[mm_data],
            data_types=['video'],
            do_sample=True,
            temperature=0.4,
            max_new_tokens=256,
            top_p=0.7,
            use_cache=True,
            num_beams=1,
            stopping_criteria=[stopping_criteria]
        )
    
    pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    print(pred)
    

Suggest Cloud GPUs

Consider using cloud services like AWS, GCP, or Azure for access to high-performance GPUs if local resources are insufficient.

License

Apollo is licensed under the Apache License 2.0. For more information, refer to the license document.

More Related APIs in Video Text To Text