Stable Video Diffusion Image-to-Video Model

Introduction

Stable Video Diffusion (SVD) Image-to-Video is a diffusion model by Stability AI designed to generate video clips from a single conditioning image. It produces short videos with 25 frames at a resolution of 576x1024 pixels, providing enhanced temporal consistency through finetuning from an earlier 14-frame version.

Architecture

The model is a latent diffusion framework, enhanced with a finetuned f8-decoder and a standard frame-wise decoder. It is built to ensure temporal consistency and motion coherence in the generated videos.

Training

The model was trained using approximately 200,000 A100 80GB hours, with the majority of the process conducted on clusters of 48 x 8 A100 GPUs. The training focused on optimizing video generation quality while maintaining performance efficiency. Human evaluation was used extensively to ensure the generative quality, utilizing third-party platforms like Amazon Mechanical Turk.

Guide: Running Locally

Setup Environment: Ensure you have a Python environment ready and install necessary dependencies from the Stability AI GitHub repository.
Download Model: Clone the repository and download the model weights from Hugging Face.
Run Inference: Use the provided scripts to input an image and generate a video.
GPU Recommendations: For optimal performance, consider using cloud GPUs such as NVIDIA A100.

License

The model is provided under the "stable-video-diffusion-community" license. It allows non-commercial and research use, with restrictions outlined in Stability AI's Acceptable Use Policy. For commercial applications, further licensing details can be found at Stability AI's license page.

stable video diffusion img2vid xt

Stable Video Diffusion Image-to-Video Model

Introduction

Architecture

Training

Guide: Running Locally

License