text to video ms 1.7b

ali-vilab

Introduction

The Text-to-Video-MS-1.7B model is a multi-stage text-to-video generation diffusion model that creates videos from English text descriptions. It operates using a diffusion process and is designed for research purposes.

Architecture

The model includes three main components:

  • A text feature extraction model.
  • A text feature-to-video latent space diffusion model.
  • A video latent space to video visual space model.

The architecture utilizes a UNet3D structure to generate videos through iterative denoising from Gaussian noise, with approximately 1.7 billion parameters.

Training

The model was trained using public datasets including LAION5B, ImageNet, and Webvid. The data underwent filtering for aesthetic quality, watermark presence, and deduplication. The model primarily supports English input and has limitations in complex compositional generation and non-English languages.

Guide: Running Locally

  1. Install Required Libraries:

    pip install diffusers transformers accelerate torch
    
  2. Generate a Video:

    import torch
    from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
    from diffusers.utils import export_to_video
    
    pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    pipe.enable_model_cpu_offload()
    
    prompt = "Spiderman is surfing"
    video_frames = pipe(prompt, num_inference_steps=25).frames
    video_path = export_to_video(video_frames)
    
  3. Optimize for Longer Videos:

    pipe.enable_vae_slicing()
    prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
    video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames
    video_path = export_to_video(video_frames)
    

Suggested Cloud GPUs: Utilize cloud services like AWS, Google Cloud, or Azure for access to powerful GPUs if local resources are insufficient.

License

The model is released under the CC-BY-NC-ND license, which allows sharing with attribution, non-commercial use, and no derivatives.

More Related APIs in Text To Video