Apollo L M Ms Apollo 7 B t32

GoodiesHere

Introduction

Apollo is a series of Large Multimodal Models (LMMs) designed for advanced video understanding tasks. These models excel in long-form video comprehension, temporal reasoning, complex video question-answering, and multi-turn video-grounded conversations. Apollo models achieve superior performance with fewer parameters compared to other models, balancing speed and accuracy effectively.

Architecture

Apollo leverages a 7B parameter variant with 32 tokens per frame, optimized for processing extensive video content. The models are designed to handle hour-long videos, offering a scalable solution that competes with larger models of up to 30B parameters.

Training

Apollo models are trained to understand and interpret video content efficiently, using advanced techniques in multimodal processing and instruction tuning. They incorporate strategic design decisions to improve performance in video-related tasks.

Guide: Running Locally

Basic Steps

  1. Installation

    pip install -e .
    pip install flash-attn --no-build-isolation
    
  2. Inference Example

    import torch
    from transformers import AutoModelForCausalLM
    from apollo.mm_utils import (
        KeywordsStoppingCriteria,
        tokenizer_mm_token,
        ApolloMMLoader
    )
    from apollo.conversations import conv_templates, SeparatorStyle
    from huggingface_hub import snapshot_download
    
    model_url = "Apollo-LMMs/Apollo-3B-t32"
    model_path = snapshot_download(model_url, repo_type="model")
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        trust_remote_code=True,
        low_cpu_mem_usage=True
    ).to(device=device, dtype=torch.bfloat16)
    
    tokenizer = model.tokenizer
    vision_processors = model.vision_tower.vision_processor
    config = model.config
    num_repeat_token = config.mm_connector_cfg['num_output_tokens']
    mm_processor = ApolloMMLoader(
        vision_processors,
        config.clip_duration,
        frames_per_clip=4,
        clip_sampling_ratio=0.65,
        model_max_length=config.model_max_length,
        device=device,
        num_repeat_token=num_repeat_token
    )
    
    video_path = "path/to/video.mp4"
    question = "Describe this video in detail"
    mm_data, replace_string = mm_processor.load_video(video_path)
    
    conv = conv_templates["qwen_2"].copy()
    conv.append_message(conv.roles[0], replace_string + "\n\n" + question)
    conv.append_message(conv.roles[1], None)
    
    prompt = conv.get_prompt()
    input_ids = tokenizer_mm_token(prompt, tokenizer, return_tensors="pt").unsqueeze(0).to(device)
    
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)
    
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            vision_input=[mm_data],
            data_types=['video'],
            do_sample=True,
            temperature=0.4,
            max_new_tokens=256,
            top_p=0.7,
            use_cache=True,
            num_beams=1,
            stopping_criteria=[stopping_criteria]
        )
    
    pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    print(pred)
    

Cloud GPUs

For optimal performance, consider using cloud-based GPUs like AWS, Google Cloud, or Azure to run the inference and training tasks.

License

Apollo is licensed under the Apache 2.0 License, granting users broad rights to use, modify, and distribute the software.

More Related APIs in Video Text To Text