stable diffusion 2 inpainting

stabilityai

Introduction

The Stable Diffusion v2 inpainting model, developed by Robin Rombach and Patrick Esser, is a diffusion-based text-to-image generation model designed for modifying and generating images from text prompts. It builds upon the stable-diffusion-2-base model and uses a mask-generation strategy for inpainting tasks.

Architecture

Stable Diffusion v2 is a latent diffusion model that combines an autoencoder with a diffusion model trained in the latent space. It uses a pretrained text encoder (OpenCLIP-ViT/H) to process text prompts, which are integrated into the model via cross-attention. The model uses a reconstruction objective to predict noise added to latent representations during training.

Training

The training data for the model is based on the LAION-5B dataset, filtered for explicit content. Various checkpoints of the model have been trained, such as 512-base-ema.ckpt for lower-resolution tasks and 512-inpainting-ema.ckpt for inpainting. The training utilized 32 x 8 A100 GPUs with the AdamW optimizer and a learning rate of 0.0001 after a warmup period.

Guide: Running Locally

  1. Install Dependencies:

    pip install diffusers transformers accelerate scipy safetensors
    
  2. Load and Run the Model:

    from diffusers import StableDiffusionInpaintPipeline
    pipe = StableDiffusionInpaintPipeline.from_pretrained(
        "stabilityai/stable-diffusion-2-inpainting",
        torch_dtype=torch.float16,
    )
    pipe.to("cuda")
    prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
    # Ensure image and mask_image are PIL images
    image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
    image.save("./yellow_cat_on_park_bench.png")
    
  3. Performance Tips:

    • Install xformers for memory-efficient attention.
    • Use pipe.enable_attention_slicing() for reduced VRAM usage.
  4. Cloud GPUs:

    • Consider using cloud services like AWS or Google Cloud with A100 GPUs for better performance.

License

The model is licensed under the CreativeML Open RAIL++-M License, which governs its use and distribution.

More Related APIs in Image To Image