stable video diffusion img2vid LLM Model

Introduction

Stable Video Diffusion (SVD) Image-to-Video is a diffusion model developed by Stability AI that generates short videos from a single input image. This model is designed for research purposes and allows the exploration of generative models, model biases, and artistic applications.

Architecture

The SVD model is a latent diffusion model capable of creating 14 video frames from a single conditioning image at a resolution of 576x1024 pixels. It employs an f8-decoder for temporal consistency, with an optional frame-wise decoder provided.

Training

The SVD model was trained using a diverse dataset, with safety and quality filtering ensured by in-house methods. Training utilized approximately 200,000 A100 GPU hours. Evaluations involved human quality assessments through platforms like Amazon Mechanical Turk, with extensive safety and trustworthiness validations.

Guide: Running Locally

Set Up Environment: Clone the repository from Stability AI's GitHub and install necessary dependencies.
Download Model: Obtain the model files from the Hugging Face model page.
Run Inference: Use the provided scripts to input an image and generate video frames.
Optimization: Adjust settings for memory and speed optimizations if needed.

For optimal performance, it is recommended to use cloud resources such as A100 GPUs available on platforms like AWS or Google Cloud.

License

The model is licensed under the Stable Video Diffusion Community License. For commercial use, refer to Stability AI's license page. Usage must adhere to Stability AI's Acceptable Use Policy.