Apollo L M Ms Apollo 7 B t32 LLM Model

Introduction

Apollo is a series of Large Multimodal Models (LMMs) designed for advanced video understanding tasks. These models excel in long-form video comprehension, temporal reasoning, complex video question-answering, and multi-turn video-grounded conversations. Apollo models achieve superior performance with fewer parameters compared to other models, balancing speed and accuracy effectively.

Architecture

Apollo leverages a 7B parameter variant with 32 tokens per frame, optimized for processing extensive video content. The models are designed to handle hour-long videos, offering a scalable solution that competes with larger models of up to 30B parameters.

Training

Apollo models are trained to understand and interpret video content efficiently, using advanced techniques in multimodal processing and instruction tuning. They incorporate strategic design decisions to improve performance in video-related tasks.

Guide: Running Locally

Basic Steps

Installation

pip install -e .
pip install flash-attn --no-build-isolation

Inference Example

import torch
from transformers import AutoModelForCausalLM
from apollo.mm_utils import (
    KeywordsStoppingCriteria,
    tokenizer_mm_token,
    ApolloMMLoader
)
from apollo.conversations import conv_templates, SeparatorStyle
from huggingface_hub import snapshot_download

model_url = "Apollo-LMMs/Apollo-3B-t32"
model_path = snapshot_download(model_url, repo_type="model")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True
).to(device=device, dtype=torch.bfloat16)

tokenizer = model.tokenizer
vision_processors = model.vision_tower.vision_processor
config = model.config
num_repeat_token = config.mm_connector_cfg['num_output_tokens']
mm_processor = ApolloMMLoader(
    vision_processors,
    config.clip_duration,
    frames_per_clip=4,
    clip_sampling_ratio=0.65,
    model_max_length=config.model_max_length,
    device=device,
    num_repeat_token=num_repeat_token
)

video_path = "path/to/video.mp4"
question = "Describe this video in detail"
mm_data, replace_string = mm_processor.load_video(video_path)

conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], replace_string + "\n\n" + question)
conv.append_message(conv.roles[1], None)

prompt = conv.get_prompt()
input_ids = tokenizer_mm_token(prompt, tokenizer, return_tensors="pt").unsqueeze(0).to(device)

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        vision_input=[mm_data],
        data_types=['video'],
        do_sample=True,
        temperature=0.4,
        max_new_tokens=256,
        top_p=0.7,
        use_cache=True,
        num_beams=1,
        stopping_criteria=[stopping_criteria]
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

Cloud GPUs

For optimal performance, consider using cloud-based GPUs like AWS, Google Cloud, or Azure to run the inference and training tasks.

License

Apollo is licensed under the Apache 2.0 License, granting users broad rights to use, modify, and distribute the software.