Mochi Model Overview

Introduction

The Mochi model is a text-to-video generative model developed by Calcuis on the Hugging Face platform. It leverages a quantized version of the T5XXL encoder with Mochi for efficient video generation from textual descriptions. This model is designed to be used with the ComfyUI interface and includes various setup options for ease of deployment.

Architecture

The Mochi model utilizes a GGUF quantized version of the T5XXL encoder, optimized to work with Mochi. It includes components like:

  • Mochi_fp8.safetensors: A 10GB file for diffusion models.
  • T5XXL_fp16-q4_0.gguf: A 2.9GB file for text encoders.
  • Mochi_vae_scaled.safetensors: A 725MB file for VAE models.

Training

The base model for Mochi is derived from Genmo's Mochi-1-preview. The architecture is designed to facilitate efficient text-to-video generation by compressing and optimizing the encoder model using GGUF quantization techniques.

Guide: Running Locally

Setup

  1. File Placement:

    • Place mochi_fp8.safetensors in ./ComfyUI/models/diffusion_models.
    • Place t5xxl_fp16-q4_0.gguf in ./ComfyUI/models/text_encoders.
    • Place mochi_vae_scaled.safetensors in ./ComfyUI/models/vae.
  2. Execution:

    • Run the .bat file located in the main directory, which assumes the use of the GGUF-Comfy pack.
    • Drag the corresponding workflow JSON file into your browser.

Workflows

Cloud GPUs

For optimal performance, consider using cloud GPU services such as AWS EC2 with GPU instances, Google Cloud GPU, or Azure N-series instances.

License

The Mochi model is released under the Apache 2.0 License, which allows for wide usage and modification with appropriate credit.

More Related APIs in Text To Video