Apollo L M Ms Apollo 7 B t32
GoodiesHereIntroduction
Apollo is a series of Large Multimodal Models (LMMs) designed for advanced video understanding tasks. These models excel in long-form video comprehension, temporal reasoning, complex video question-answering, and multi-turn video-grounded conversations. Apollo models achieve superior performance with fewer parameters compared to other models, balancing speed and accuracy effectively.
Architecture
Apollo leverages a 7B parameter variant with 32 tokens per frame, optimized for processing extensive video content. The models are designed to handle hour-long videos, offering a scalable solution that competes with larger models of up to 30B parameters.
Training
Apollo models are trained to understand and interpret video content efficiently, using advanced techniques in multimodal processing and instruction tuning. They incorporate strategic design decisions to improve performance in video-related tasks.
Guide: Running Locally
Basic Steps
-
Installation
pip install -e . pip install flash-attn --no-build-isolation
-
Inference Example
import torch from transformers import AutoModelForCausalLM from apollo.mm_utils import ( KeywordsStoppingCriteria, tokenizer_mm_token, ApolloMMLoader ) from apollo.conversations import conv_templates, SeparatorStyle from huggingface_hub import snapshot_download model_url = "Apollo-LMMs/Apollo-3B-t32" model_path = snapshot_download(model_url, repo_type="model") device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, low_cpu_mem_usage=True ).to(device=device, dtype=torch.bfloat16) tokenizer = model.tokenizer vision_processors = model.vision_tower.vision_processor config = model.config num_repeat_token = config.mm_connector_cfg['num_output_tokens'] mm_processor = ApolloMMLoader( vision_processors, config.clip_duration, frames_per_clip=4, clip_sampling_ratio=0.65, model_max_length=config.model_max_length, device=device, num_repeat_token=num_repeat_token ) video_path = "path/to/video.mp4" question = "Describe this video in detail" mm_data, replace_string = mm_processor.load_video(video_path) conv = conv_templates["qwen_2"].copy() conv.append_message(conv.roles[0], replace_string + "\n\n" + question) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_mm_token(prompt, tokenizer, return_tensors="pt").unsqueeze(0).to(device) stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids) with torch.inference_mode(): output_ids = model.generate( input_ids, vision_input=[mm_data], data_types=['video'], do_sample=True, temperature=0.4, max_new_tokens=256, top_p=0.7, use_cache=True, num_beams=1, stopping_criteria=[stopping_criteria] ) pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(pred)
Cloud GPUs
For optimal performance, consider using cloud-based GPUs like AWS, Google Cloud, or Azure to run the inference and training tasks.
License
Apollo is licensed under the Apache 2.0 License, granting users broad rights to use, modify, and distribute the software.