Introduction

The SD-VAE-FT-EMA project by Stability AI offers improved autoencoders for use with the Diffusers library. These weights enhance the performance of the Stable Diffusion model, focusing on better image reconstructions, particularly for human faces.

Architecture

The model features two fine-tuned versions of the kl-f8 autoencoder, both intended to replace the original autoencoder in Stable Diffusion workflows. The first version, ft-EMA, maintains the original training configuration with additional training steps, while the second, ft-MSE, uses a modified loss function to produce smoother outputs. Both models utilize Exponential Moving Average (EMA) weights.

Training

The models were fine-tuned on a combination of the LAION-Aesthetics and LAION-Humans datasets. The ft-EMA model resumed training from the original checkpoint for 313,198 steps, using a loss configuration of L1 + LPIPS. The ft-MSE model continued from ft-EMA for an additional 280,000 steps, emphasizing Mean Squared Error (MSE) reconstruction with a loss of MSE + 0.1 * LPIPS. The training employed a batch size of 192, distributed over 16 A100 GPUs.

Guide: Running Locally

To use the fine-tuned VAE with Diffusers:

  1. Install the Diffusers library:

    pip install diffusers
    
  2. Load the VAE model and integrate it into your Stable Diffusion pipeline:

    from diffusers.models import AutoencoderKL
    from diffusers import StableDiffusionPipeline
    
    model = "CompVis/stable-diffusion-v1-4"
    vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema")
    pipe = StableDiffusionPipeline.from_pretrained(model, vae=vae)
    
  3. Run your Stable Diffusion tasks with the enhanced VAE.

For optimal performance, consider using cloud GPUs such as AWS EC2 or Google Cloud's AI Platform.

License

This model is licensed under the MIT License, allowing for broad use and modification.

More Related APIs