stable diffusion v1 5 inpainting

botp

Introduction

Stable Diffusion Inpainting is a latent text-to-image diffusion model that generates photo-realistic images from text inputs. It includes inpainting capabilities, allowing for image editing using masks.

Architecture

The model architecture consists of a latent diffusion model leveraging a pretrained text encoder (CLIP ViT-L/14) for text input processing. It uses a UNet with additional input channels for inpainting tasks.

Training

The model underwent initial training with the Stable-Diffusion-v-1-2 weights for 595k steps, followed by 440k steps of inpainting training at a resolution of 512x512 on the "laion-aesthetics v2 5+" dataset. The hardware used included 32 A100 GPUs, with optimizations such as AdamW optimizer and a batch size of 2048.

Guide: Running Locally

  1. Set Up Environment: Install the diffusers library using pip:

    pip install diffusers
    
  2. Load the Model: Use the StableDiffusionInpaintPipeline class:

    from diffusers import StableDiffusionInpaintPipeline
    
    pipe = StableDiffusionInpaintPipeline.from_pretrained(
        "runwayml/stable-diffusion-inpainting",
        revision="fp16",
        torch_dtype=torch.float16,
    )
    
  3. Generate Images: Prepare a prompt and use a mask image for inpainting:

    prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
    # Ensure `image` and `mask_image` are PIL images
    image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
    image.save("./yellow_cat_on_park_bench.png")
    
  4. Cloud GPUs: Utilize cloud GPUs such as AWS or Google Colab for enhanced performance.

License

The model is released under the CreativeML OpenRAIL-M license. This license allows for open access and commercial use, with restrictions on generating harmful or illegal content. Redistribution requires adherence to the same license terms. Please review the full license here.

More Related APIs in Text To Image