stable video diffusion img2vid xt LLM Model

Introduction

The Stable Video Diffusion (SVD) Image-to-Video model is a latent diffusion model designed to generate short video clips from a still image as a conditioning frame. It produces 25 frames at a resolution of 576x1024 and is fine-tuned for temporal consistency using the f8-decoder.

Architecture

Model Type: Generative image-to-video model
Base Model: Finetuned from SVD Image-to-Video (14 frames version)
Developed and Funded by: Stability AI
Decoders Used: f8-decoder and a standard frame-wise decoder

Training

The model is trained to generate video clips by conditioning on a still image. It is further fine-tuned from a previous version that generates 14 frames, with added improvements for temporal consistency. The training framework and inference methods are implemented in Stability AI's generative-models GitHub repository.

Guide: Running Locally

Clone Repository:
Clone the repository from Stability AI's GitHub.
Setup Environment:
Install the necessary dependencies as listed in the repository's documentation.
Run Model:
Use the provided scripts to run the model on your local machine.
Hardware Recommendations:
- Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure to access GPUs for efficient processing.

License

The model is released under the stable-video-diffusion-nc-community license. Usage is intended for research purposes only, and all users must adhere to Stability AI's Acceptable Use Policy. The model should not be used to generate factual representations of people or events.