Cog Video X1.5 5 B I2 V LLM Model

Introduction

CogVideoX1.5-5B-I2V is an open-source video generation model developed by the Knowledge Engineering Group at Tsinghua University. It is part of the CogVideoX series, designed for converting images to videos using advanced machine learning techniques. The model supports English prompts and is optimized for generating videos with a resolution of 1360x768, offering several precision and memory configurations for efficient inference.

Architecture

The model architecture involves several components, including a text encoder, transformer, and VAE modules. It utilizes the Hugging Face diffusers library for efficient video generation. The architecture supports various precision modes such as BF16, FP16, and FP32, with specific requirements for GPU memory and inference speed. Optimizations can be applied using tools like PytorchAO and Optimum-quanto to reduce VRAM usage and speed up processing times.

Training

CogVideoX is designed to be efficient and flexible, accommodating multiple precision levels and VRAM optimizations. The model can be run on single or multiple GPU setups, with specific configurations for NVIDIA A100 and H100 GPUs. It supports optimizations like sequential CPU offloading, VAE slicing, and tiling to manage memory and enhance inference speed. Quantization techniques are available to further reduce memory requirements, enabling the model to run on lower VRAM GPUs.

Guide: Running Locally

Install Dependencies:

pip install git+https://github.com/huggingface/diffusers
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

Run the Code:

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
image = load_image(image="input.jpg")
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B-I2V",
    torch_dtype=torch.bfloat16
)

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    image=image,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Suggested Cloud GPUs: Consider using NVIDIA A100 or H100 GPUs for optimal performance due to their support for BF16 and FP8 precisions.

License

The CogVideoX1.5-5B-I2V model is released under the CogVideoX LICENSE. For more information, refer to the license link.