Cog Video X 2b
THUDMIntroduction
CogVideoX-2B is an open-source text-to-video model developed by the Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University. It is designed to generate videos from textual prompts, utilizing a diffusion-based pipeline. The model supports English input and offers compatibility with various precisions and GPU configurations to optimize performance.
Architecture
The model consists of several key components, including a text encoder, a Transformer for video generation, and a Variational Autoencoder (VAE). It supports multiple precision modes like FP16, BF16, and INT8 to balance performance and memory usage. The architecture allows for both single and multi-GPU configurations, optimizing VRAM consumption and inference speed.
Training
CogVideoX-2B is trained primarily in FP16 precision, while the more advanced CogVideoX-5B uses BF16. The model's training process is optimized for compatibility and low-cost usage, making it suitable for both development and production environments. The training setup encourages fine-tuning to adapt to specific needs, with various precision settings available to balance resource consumption and performance.
Guide: Running Locally
-
Install Dependencies: Ensure you have the necessary Python packages installed.
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
-
Run the Model: Use the following Python code to generate a video from a text prompt.
import torch from diffusers import CogVideoXPipeline from diffusers.utils import export_to_video prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. ..." pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() video = pipe( prompt=prompt, num_videos_per_prompt=1, num_inference_steps=50, num_frames=49, guidance_scale=6, generator=torch.Generator(device="cuda").manual_seed(42), ).frames[0] export_to_video(video, "output.mp4", fps=8)
-
Cloud GPU Recommendation: Use a cloud service like Google Colab with a T4 GPU or higher for optimal performance. Ensure VRAM optimization settings are applied to manage resource consumption efficiently.
License
The CogVideoX-2B model is released under the Apache 2.0 License, permitting use, modification, and distribution under its terms. The CogVideoX-5B model is governed by the CogVideoX LICENSE.