pyramid flow miniflux

rain1011

Introduction

Pyramid Flow MiniFLUX is an autoregressive video generation model that utilizes flow matching for training efficiency. It is capable of generating high-quality 10-second videos at 768p resolution and 24 FPS, supporting both text-to-video and image-to-video generation.

Architecture

The model employs a MiniFLUX architecture, which improves human structure and motion stability compared to previous versions like SD3. It supports video generation at various resolutions, including 384p, 768p, and image generation at 1024p. The architecture allows for efficient training and inference, making use of sequential CPU offloading and VRAM-efficient features.

Training

The training process leverages open-source datasets, and the model is trained from scratch with a FLUX structure. The training code and model checkpoints are available for use and further experimentation. The latest improvements in model architecture have enhanced performance in terms of human structure accuracy and motion stability.

Guide: Running Locally

  1. Installation:

    • Set up the environment using Conda:
      git clone https://github.com/jy0205/Pyramid-Flow
      cd Pyramid-Flow
      conda create -n pyramid python==3.8.10
      conda activate pyramid
      pip install -r requirements.txt
      
    • Download the model from Hugging Face:
      from huggingface_hub import snapshot_download
      
      model_path = 'PATH'  # Local directory to save the checkpoint
      snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
      
  2. Usage:

    • For inference, load the model and run text-to-video generation:

      import torch
      from pyramid_dit import PyramidDiTForVideoGeneration
      
      torch.cuda.set_device(0)
      model_dtype, torch_dtype = 'bf16', torch.bfloat16
      model = PyramidDiTForVideoGeneration('PATH', model_name="pyramid_flux", model_dtype, model_variant='diffusion_transformer_768p')
      model.enable_sequential_cpu_offload()
      
      prompt = "A movie trailer featuring the adventures of the 30 year old space man..."
      with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
          frames = model.generate(prompt=prompt, num_inference_steps=[20, 20, 20], height=768, width=1280, temp=16, guidance_scale=7.0, video_guidance_scale=5.0, output_type="pil", save_memory=True)
      export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
      
    • Cloud GPUs: For efficient video generation, using cloud GPUs such as AWS, Google Cloud, or Azure is recommended.

License

The Pyramid Flow MiniFLUX model is licensed under the Apache 2.0 License, allowing for wide usage and modification within the terms of the license.

More Related APIs in Text To Video