sd vae ft mse original

stabilityai

Introduction

The SD-VAE-FT-MSE-ORIGINAL model is a fine-tuned autoencoder designed for use with the Stable Diffusion text-to-image generation model. Developed by Stability AI, this model improves upon the original kl-f8 autoencoder by focusing on image reconstruction quality, particularly for human faces.

Architecture

The model involves two fine-tuned versions of the kl-f8 autoencoder: ft-EMA and ft-MSE. The ft-EMA version was trained with Exponential Moving Average (EMA) weights for 313,198 steps, using a combination of L1 and LPIPS loss functions. The ft-MSE version continued from ft-EMA, focusing more on Mean Squared Error (MSE) for smoother outputs, trained for an additional 280,000 steps with a revised loss function (MSE + 0.1 * LPIPS).

Training

The training utilized a dataset comprising a 1:1 ratio of LAION-Aesthetics and LAION-Humans images. The batch size was 192, distributed over 16 NVIDIA A100 GPUs. Evaluation metrics, such as rFID, PSNR, SSIM, and PSIM, were used to assess model performance, showing improvements in image quality and reconstruction fidelity over the original autoencoder.

Guide: Running Locally

  1. Prerequisites: Ensure you have access to a suitable GPU, preferably a cloud GPU like NVIDIA A100, for efficient processing.
  2. Setup: Clone the original CompVis Stable Diffusion codebase from GitHub.
  3. Model Weights: Download the appropriate model weights:
  4. Integration: Use the downloaded weights as a drop-in replacement for the existing autoencoder in the Stable Diffusion setup.
  5. Execution: Follow the Stable Diffusion codebase instructions to generate images using these fine-tuned models.

License

This project is licensed under the MIT License, allowing for open collaboration and modification.

More Related APIs in Text To Image