Stable Diffusion Image Variations Model Card

Introduction

The Stable Diffusion Image Variations model has been fine-tuned from the CompVis/stable-diffusion-v1-4-original model to generate "image variations" using CLIP image embeddings instead of text embeddings. This allows for the creation of variations similar to DALLE-2. The model is available in the 🤗 Diffusers library, requiring the Lambda Diffusers repository for use.

Architecture

The model architecture involves replacing the text encoder with an image encoder in the Stable Diffusion framework. Images are encoded through a ViT-L/14 image-encoder, projecting to the CLIP shared embedding space. The training dataset is the LAION improved aesthetics 6plus.

Training

The model was trained in two stages on 8 x A100-40GB GPUs using AdamW optimizer:

Stage 1: Fine-tuned only the CrossAttention layer weights from the Stable Diffusion v1.4 model for 46,000 steps with a total batch size of 128. The learning rate was warmed up to 1e-5 over 10,000 steps.
Stage 2: Resumed from Stage 1, training the entire UNet for 50,000 steps with a total batch size of 160. The learning rate was warmed up to 1e-5 over 5,000 steps.

Guide: Running Locally

To run the model locally, ensure you have Diffusers version >=0.8.0. Use the following Python code snippet:

from diffusers import StableDiffusionImageVariationPipeline
from PIL import Image
import torch
from torchvision import transforms

device = "cuda:0" if torch.cuda.is_available() else "cpu"
sd_pipe = StableDiffusionImageVariationPipeline.from_pretrained(
  "lambdalabs/sd-image-variations-diffusers",
  revision="v2.0",
)
sd_pipe = sd_pipe.to(device)

im = Image.open("path/to/image.jpg")
tform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224), interpolation=transforms.InterpolationMode.BICUBIC, antialias=False),
    transforms.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]),
])
inp = tform(im).to(device).unsqueeze(0)

out = sd_pipe(inp, guidance_scale=3)
out["images"][0].save("result.jpg")

Cloud GPUs such as those provided by Lambda GPU Cloud can be used to enhance performance.

License

The model is licensed under the creativeml-openrail-m, intended for research purposes only. Misuse, including generating harmful or offensive content, is prohibited. The model is not trained for factual representations and includes limitations and biases inherent in the training data.