sd vae ft ema original
stabilityaiIntroduction
This documentation provides an overview of the SD-VAE-FT-EMA-ORIGINAL model by Stability AI, focusing on improved autoencoders for text-to-image applications using the Stable Diffusion framework. The model is licensed under the MIT License and is intended for use with the original CompVis Stable Diffusion codebase.
Architecture
The model features two kl-f8 autoencoder versions fine-tuned from the original kl-f8 autoencoder, aimed at enhancing image reconstruction, particularly for human faces. The fine-tuning process involved a 1:1 ratio of LAION-Aesthetics and LAION-Humans datasets. The model maintains compatibility by only fine-tuning the decoder part, allowing it to serve as a drop-in replacement for the existing autoencoder.
Training
Training was performed in two phases:
- ft-EMA: Resumed from the original checkpoint and trained for 313,198 steps using EMA weights, with the same loss configuration as the original (L1 + LPIPS).
- ft-MSE: Further trained for an additional 280,000 steps from ft-EMA, emphasizing MSE reconstruction (MSE + 0.1 * LPIPS) to produce smoother outputs.
Both versions were trained with a batch size of 192 across 16 A100 GPUs. Evaluation metrics such as rFID, PSNR, SSIM, and PSIM were used, showing improvements in output quality.
Guide: Running Locally
- Prerequisites: Ensure you have Python and the required libraries installed.
- Download the Model: Obtain the model checkpoints from Hugging Face:
- Setup Environment: Clone the original CompVis Stable Diffusion repository and set up the environment as per instructions.
- Run the Model: Replace the existing autoencoder with the downloaded checkpoint and execute the inference script.
- Cloud GPUs: For optimal performance, consider using cloud-based GPUs like AWS EC2 with NVIDIA A100 instances.
License
The SD-VAE-FT-EMA-ORIGINAL model is released under the MIT License, allowing for wide usage and modification in both personal and commercial projects.