Introduction

The SD-VAE-FT-MSE model is a variant of the Stable Diffusion autoencoder designed to improve image reconstruction quality. It utilizes fine-tuned VAE (Variational Autoencoder) decoders integrated with the diffusers library, providing enhanced capabilities for generating smoother outputs.

Architecture

The model builds upon the KL-f8 autoencoder architecture. Two fine-tuned versions, ft-EMA and ft-MSE, are available. Both versions are trained with a focus on maintaining compatibility with existing models by only fine-tuning the decoder component. The ft-EMA model uses Exponential Moving Average (EMA) weights, while ft-MSE emphasizes Mean Squared Error (MSE) for smoother image outputs.

Training

The models were trained on a blend of the LAION-Aesthetics and LAION-Humans datasets, with a batch size of 192 distributed across 16 A100 GPUs. The ft-EMA version was trained for 313,198 steps using L1 and LPIPS loss configurations, while the ft-MSE version continued from ft-EMA for an additional 280,000 steps with a focus on MSE reconstruction.

Guide: Running Locally

To use the SD-VAE-FT-MSE model locally, follow these steps:

  1. Install the diffusers library:

    pip install diffusers
    
  2. Load the model with the diffusers library:

    from diffusers.models import AutoencoderKL
    from diffusers import StableDiffusionPipeline
    
    model = "CompVis/stable-diffusion-v1-4"
    vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
    pipe = StableDiffusionPipeline.from_pretrained(model, vae=vae)
    
  3. Run the pipeline using a suitable computing environment. For best performance, especially for large-scale image generation, consider using cloud GPUs like NVIDIA A100.

License

The SD-VAE-FT-MSE model is licensed under the MIT License, allowing for free use, modification, and distribution of the software.

More Related APIs