Cog Video X 5b
THUDMCogVideoX-5B
Introduction
CogVideoX-5B is an advanced text-to-video generation model developed by the Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University. It is an open-source version, providing high-quality video generation and visual effects. The model supports English prompts and delivers outputs with a resolution of 720x480 pixels.
Architecture
CogVideoX-5B utilizes a diffusion model architecture that includes a text encoder, a Transformer, and a Variational Autoencoder (VAE). The model supports various inference precisions, such as BF16, FP16, and FP32, and can be optimized for VRAM efficiency using diffusers and quantization techniques.
Training
The model is trained using BF16 precision, which is recommended for inference. It supports fine-tuning with techniques like LORA and SFT, requiring substantial VRAM. The model's prompt length is limited to 226 tokens, generating videos that are 6 seconds long at 8 frames per second.
Guide: Running Locally
-
Install Dependencies:
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
-
Run the Code:
import torch from diffusers import CogVideoXPipeline from diffusers.utils import export_to_video prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest..." pipe = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() pipe.vae.enable_tiling() video = pipe( prompt=prompt, num_videos_per_prompt=1, num_inference_steps=50, num_frames=49, guidance_scale=6, generator=torch.Generator(device="cuda").manual_seed(42), ).frames[0] export_to_video(video, "output.mp4", fps=8)
-
Use Cloud GPUs:
Consider using cloud resources like NVIDIA A100 or H100 GPUs for efficient processing, especially for large-scale or intensive tasks.
License
The model is released under the CogVideoX LICENSE. For more details, refer to the license document.