Cog Video X 2b

THUDM

Introduction

CogVideoX-2B is an open-source text-to-video model developed by the Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University. It is designed to generate videos from textual prompts, utilizing a diffusion-based pipeline. The model supports English input and offers compatibility with various precisions and GPU configurations to optimize performance.

Architecture

The model consists of several key components, including a text encoder, a Transformer for video generation, and a Variational Autoencoder (VAE). It supports multiple precision modes like FP16, BF16, and INT8 to balance performance and memory usage. The architecture allows for both single and multi-GPU configurations, optimizing VRAM consumption and inference speed.

Training

CogVideoX-2B is trained primarily in FP16 precision, while the more advanced CogVideoX-5B uses BF16. The model's training process is optimized for compatibility and low-cost usage, making it suitable for both development and production environments. The training setup encourages fine-tuning to adapt to specific needs, with various precision settings available to balance resource consumption and performance.

Guide: Running Locally

  1. Install Dependencies: Ensure you have the necessary Python packages installed.

    pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
    
  2. Run the Model: Use the following Python code to generate a video from a text prompt.

    import torch
    from diffusers import CogVideoXPipeline
    from diffusers.utils import export_to_video
    
    prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. ..."
    pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
    
    pipe.enable_model_cpu_offload()
    pipe.enable_sequential_cpu_offload()
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()
    
    video = pipe(
        prompt=prompt,
        num_videos_per_prompt=1,
        num_inference_steps=50,
        num_frames=49,
        guidance_scale=6,
        generator=torch.Generator(device="cuda").manual_seed(42),
    ).frames[0]
    
    export_to_video(video, "output.mp4", fps=8)
    
  3. Cloud GPU Recommendation: Use a cloud service like Google Colab with a T4 GPU or higher for optimal performance. Ensure VRAM optimization settings are applied to manage resource consumption efficiently.

License

The CogVideoX-2B model is released under the Apache 2.0 License, permitting use, modification, and distribution under its terms. The CogVideoX-5B model is governed by the CogVideoX LICENSE.

More Related APIs in Text To Video