text to video ms 1.7b
ali-vilabIntroduction
The Text-to-Video-MS-1.7B model is a multi-stage text-to-video generation diffusion model that creates videos from English text descriptions. It operates using a diffusion process and is designed for research purposes.
Architecture
The model includes three main components:
- A text feature extraction model.
- A text feature-to-video latent space diffusion model.
- A video latent space to video visual space model.
The architecture utilizes a UNet3D structure to generate videos through iterative denoising from Gaussian noise, with approximately 1.7 billion parameters.
Training
The model was trained using public datasets including LAION5B, ImageNet, and Webvid. The data underwent filtering for aesthetic quality, watermark presence, and deduplication. The model primarily supports English input and has limitations in complex compositional generation and non-English languages.
Guide: Running Locally
-
Install Required Libraries:
pip install diffusers transformers accelerate torch
-
Generate a Video:
import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() prompt = "Spiderman is surfing" video_frames = pipe(prompt, num_inference_steps=25).frames video_path = export_to_video(video_frames)
-
Optimize for Longer Videos:
pipe.enable_vae_slicing() prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman" video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames video_path = export_to_video(video_frames)
Suggested Cloud GPUs: Utilize cloud services like AWS, Google Cloud, or Azure for access to powerful GPUs if local resources are insufficient.
License
The model is released under the CC-BY-NC-ND license, which allows sharing with attribution, non-commercial use, and no derivatives.