Cog Video X 5b I2 V LLM Model

Introduction

CogVideoX-5B-I2V is an open-source video generation model developed by the Knowledge Engineering Group and Data Mining at Tsinghua University. It utilizes advanced diffusion models and expert transformers for converting images to videos, specifically supporting English prompts. The model operates with different precisions including FP16, BF16, and INT8, and is optimized for NVIDIA architectures.

Architecture

CogVideoX models are designed with varying sizes, offering different levels of video generation quality and resource requirements. The CogVideoX-5B-I2V model is the image-to-video variant, featuring precision support for BF16 and FP16. It includes components like a text encoder, transformer, and VAE (Variational Autoencoder), which can be quantized to optimize performance on lower-memory GPUs. The architecture supports memory optimizations like sequential CPU offloading, slicing, and tiling to enhance efficiency during inference.

Training

The CogVideoX-2B model is trained using FP16 precision, while the CogVideoX-5B models utilize BF16 precision. Training is optimized for use on NVIDIA's Ampere architecture and above, with fine-tuning supported through Zero 2 optimization in multi-GPU environments. The model supports fine-tuning with LoRA (Low-Rank Adaptation) and SFT (Sequential Fine-Tuning) techniques, requiring specific GPU memory resources based on batch size and precision.

Guide: Running Locally

Install Dependencies:

pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

Run the Code:

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
image = load_image(image="input.jpg")
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.bfloat16
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    image=image,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Suggested Cloud GPUs:
- NVIDIA A100 or H100 for optimal performance.
- Free T4 Colab or lower-memory GPUs with quantization optimizations.

License

The CogVideoX-5B-I2V model is distributed under the CogVideoX LICENSE. The full license details can be accessed here.