animatediff sparsectrl rgb
guoywwIntroduction
AnimateDiff is a method for generating videos using pre-existing Stable Diffusion Text-to-Image models. It achieves coherent motion across frames by incorporating motion module layers into a frozen text-to-image model and training it on video clips to extract motion priors. These motion modules are integrated after the ResNet and Attention blocks within the Stable Diffusion UNet.
Architecture
The architecture introduces the concepts of a MotionAdapter and UNetMotionModel, facilitating the use of motion modules with existing Stable Diffusion models. The SparseControlNetModel, a variant of ControlNet, is implemented for AnimateDiff. ControlNet adds conditional control to Text-to-Image Diffusion Models and is extended as SparseCtrl in the context of Text-to-Video diffusion models.
Training
The model is trained by integrating motion modules into a pre-trained text-to-image model, allowing it to learn motion priors from video clips. This training process enhances the model's ability to create coherent motion across video frames while maintaining the fidelity of static image generation.
Guide: Running Locally
To run AnimateDiff locally, follow these steps:
-
Set up the environment:
- Ensure you have a compatible GPU. A cloud GPU service such as AWS, Google Cloud, or Azure is recommended for optimal performance.
-
Install necessary libraries:
pip install torch diffusers
-
Load pre-trained models:
from diffusers import AnimateDiffSparseControlNetPipeline from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel from diffusers.schedulers import DPMSolverMultistepScheduler from diffusers.utils import export_to_gif, load_image model_id = "SG161222/Realistic_Vision_V5.1_noVAE" motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3" controlnet_id = "guoyww/animatediff-sparsectrl-rgb" lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3" vae_id = "stabilityai/sd-vae-ft-mse" device = "cuda" motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device) controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device) vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device) scheduler = DPMSolverMultistepScheduler.from_pretrained( model_id, subfolder="scheduler", beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True, ) pipe = AnimateDiffSparseControlNetPipeline.from_pretrained( model_id, motion_adapter=motion_adapter, controlnet=controlnet, vae=vae, scheduler=scheduler, torch_dtype=torch.float16, ).to(device) pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
-
Generate video:
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png") video = pipe( prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background", negative_prompt="low quality, worst quality", num_inference_steps=25, conditioning_frames=image, controlnet_frame_indices=[0], controlnet_conditioning_scale=1.0, generator=torch.Generator().manual_seed(42), ).frames[0] export_to_gif(video, "output.gif")
License
The AnimateDiff model and associated code are released under an open-source license, allowing free use and modification. Please refer to the repository for specific licensing terms and conditions.