text to video ms 1.7b LLM Model

Introduction

The Text-to-Video-MS-1.7B model is a multi-stage text-to-video generation diffusion model that creates videos from English text descriptions. It operates using a diffusion process and is designed for research purposes.

Architecture

The model includes three main components:

A text feature extraction model.
A text feature-to-video latent space diffusion model.
A video latent space to video visual space model.

The architecture utilizes a UNet3D structure to generate videos through iterative denoising from Gaussian noise, with approximately 1.7 billion parameters.

Training

The model was trained using public datasets including LAION5B, ImageNet, and Webvid. The data underwent filtering for aesthetic quality, watermark presence, and deduplication. The model primarily supports English input and has limitations in complex compositional generation and non-English languages.

Guide: Running Locally

Install Required Libraries:

pip install diffusers transformers accelerate torch

Generate a Video:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

Optimize for Longer Videos:

pipe.enable_vae_slicing()
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames
video_path = export_to_video(video_frames)

Suggested Cloud GPUs: Utilize cloud services like AWS, Google Cloud, or Azure for access to powerful GPUs if local resources are insufficient.

License

The model is released under the CC-BY-NC-ND license, which allows sharing with attribution, non-commercial use, and no derivatives.