stable video diffusion img2vid xt
model-hubIntroduction
The Stable Video Diffusion (SVD) Image-to-Video model is a latent diffusion model designed to generate short video clips from a still image as a conditioning frame. It produces 25 frames at a resolution of 576x1024 and is fine-tuned for temporal consistency using the f8-decoder.
Architecture
- Model Type: Generative image-to-video model
- Base Model: Finetuned from SVD Image-to-Video (14 frames version)
- Developed and Funded by: Stability AI
- Decoders Used: f8-decoder and a standard frame-wise decoder
Training
The model is trained to generate video clips by conditioning on a still image. It is further fine-tuned from a previous version that generates 14 frames, with added improvements for temporal consistency. The training framework and inference methods are implemented in Stability AI's generative-models GitHub repository.
Guide: Running Locally
-
Clone Repository:
Clone the repository from Stability AI's GitHub. -
Setup Environment:
Install the necessary dependencies as listed in the repository's documentation. -
Run Model:
Use the provided scripts to run the model on your local machine. -
Hardware Recommendations:
- Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure to access GPUs for efficient processing.
License
The model is released under the stable-video-diffusion-nc-community
license. Usage is intended for research purposes only, and all users must adhere to Stability AI's Acceptable Use Policy. The model should not be used to generate factual representations of people or events.