animatediff sparsectrl rgb

guoyww

Introduction

AnimateDiff is a method for generating videos using pre-existing Stable Diffusion Text-to-Image models. It achieves coherent motion across frames by incorporating motion module layers into a frozen text-to-image model and training it on video clips to extract motion priors. These motion modules are integrated after the ResNet and Attention blocks within the Stable Diffusion UNet.

Architecture

The architecture introduces the concepts of a MotionAdapter and UNetMotionModel, facilitating the use of motion modules with existing Stable Diffusion models. The SparseControlNetModel, a variant of ControlNet, is implemented for AnimateDiff. ControlNet adds conditional control to Text-to-Image Diffusion Models and is extended as SparseCtrl in the context of Text-to-Video diffusion models.

Training

The model is trained by integrating motion modules into a pre-trained text-to-image model, allowing it to learn motion priors from video clips. This training process enhances the model's ability to create coherent motion across video frames while maintaining the fidelity of static image generation.

Guide: Running Locally

To run AnimateDiff locally, follow these steps:

  1. Set up the environment:

    • Ensure you have a compatible GPU. A cloud GPU service such as AWS, Google Cloud, or Azure is recommended for optimal performance.
  2. Install necessary libraries:

    pip install torch diffusers
    
  3. Load pre-trained models:

    from diffusers import AnimateDiffSparseControlNetPipeline
    from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
    from diffusers.schedulers import DPMSolverMultistepScheduler
    from diffusers.utils import export_to_gif, load_image
    
    model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
    motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
    controlnet_id = "guoyww/animatediff-sparsectrl-rgb"
    lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
    vae_id = "stabilityai/sd-vae-ft-mse"
    device = "cuda"
    
    motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
    controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
    vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
    scheduler = DPMSolverMultistepScheduler.from_pretrained(
        model_id,
        subfolder="scheduler",
        beta_schedule="linear",
        algorithm_type="dpmsolver++",
        use_karras_sigmas=True,
    )
    pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
        model_id,
        motion_adapter=motion_adapter,
        controlnet=controlnet,
        vae=vae,
        scheduler=scheduler,
        torch_dtype=torch.float16,
    ).to(device)
    pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
    
  4. Generate video:

    image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png")
    
    video = pipe(
        prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background",
        negative_prompt="low quality, worst quality",
        num_inference_steps=25,
        conditioning_frames=image,
        controlnet_frame_indices=[0],
        controlnet_conditioning_scale=1.0,
        generator=torch.Generator().manual_seed(42),
    ).frames[0]
    export_to_gif(video, "output.gif")
    

License

The AnimateDiff model and associated code are released under an open-source license, allowing free use and modification. Please refer to the repository for specific licensing terms and conditions.

More Related APIs in Text To Video