Apollo L M Ms Apollo 1_5 B t32
GoodiesHereApollo-LMMs-Apollo-1_5B-t32
Introduction
Apollo is a family of Large Multimodal Models (LMMs) designed for advanced video understanding tasks. These models excel in long-form video comprehension, temporal reasoning, complex video question-answering, and conducting multi-turn conversations based on video content. Notably, Apollo models achieve high performance with fewer parameters, outperforming many larger models.
Architecture
Apollo's architecture includes:
- Scaling Consistency: Effective transfer of design decisions from smaller to larger models reduces computational costs.
- Efficient Video Sampling: Utilizes frames-per-second (fps) sampling and advanced token resampling strategies like Perceiver for enhanced temporal perception.
- Encoder Synergies: Combines SigLIP-SO400M (image) with InternVideo2 (video) for robust representation, surpassing single encoder performance on temporal tasks.
- ApolloBench: A benchmark for evaluating true video understanding capabilities, offering a 41x faster evaluation process.
Training
Apollo models are trained to handle extensive video content, strategically balancing speed and accuracy. They excel in tasks requiring comprehensive video understanding, benefiting from advanced temporal perception techniques and encoder synergies.
Guide: Running Locally
Basic Steps
-
Installation:
pip install -e . pip install flash-attn --no-build-isolation
-
Inference Example:
import torch from transformers import AutoModelForCausalLM from apollo.mm_utils import ( KeywordsStoppingCriteria, tokenizer_mm_token, ApolloMMLoader ) from huggingface_hub import snapshot_download model_url = "Apollo-LMMs/Apollo-3B-t32" model_path = snapshot_download(model_url, repo_type="model") device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, low_cpu_mem_usage=True ).to(device=device, dtype=torch.bfloat16) tokenizer = model.tokenizer vision_processors = model.vision_tower.vision_processor config = model.config num_repeat_token = config.mm_connector_cfg['num_output_tokens'] mm_processor = ApolloMMLoader( vision_processors, config.clip_duration, frames_per_clip=4, clip_sampling_ratio=0.65, model_max_length=config.model_max_length, device=device, num_repeat_token=num_repeat_token ) video_path = "path/to/video.mp4" question = "Describe this video in detail" mm_data, replace_string = mm_processor.load_video(video_path) conv = conv_templates["qwen_2"].copy() conv.append_message(conv.roles[0], replace_string + "\n\n" + question) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_mm_token(prompt, tokenizer, return_tensors="pt").unsqueeze(0).to(device) stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 stopping_criteria = KeywordsStoppingCriteria([stop_str], tokenizer, input_ids) with torch.inference_mode(): output_ids = model.generate( input_ids, vision_input=[mm_data], data_types=['video'], do_sample=True, temperature=0.4, max_new_tokens=256, top_p=0.7, use_cache=True, num_beams=1, stopping_criteria=[stopping_criteria] ) pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(pred)
Suggest Cloud GPUs
Consider using cloud services like AWS, GCP, or Azure for access to high-performance GPUs if local resources are insufficient.
License
Apollo is licensed under the Apache License 2.0. For more information, refer to the license document.