stable diffusion image conditioned
lambdalabsIntroduction
The Stable Diffusion Image Variations model is a fine-tuned version of the original Stable Diffusion v1-3, designed to generate image variations using CLIP image embeddings. This approach allows for creative uses such as generating art and design variations similar to DALLE-2.
Architecture
This model swaps the text encoder of Stable Diffusion with an image encoder, specifically using a ViT-L/14 image encoder with a final projection layer into the CLIP shared embedding space. This architecture enables the model to process image embeddings instead of text inputs.
Training
The model was trained on subsets of the LAION-2B dataset, using 4 A6000 GPUs provided by Lambda GPU Cloud. The training process involved 87,000 steps with a batch size of 24, using the AdamW optimizer. The learning rate was warmed up to 0.0001 over 1,000 steps and then kept constant. Training employed a modified version of the original Stable Diffusion code.
Guide: Running Locally
To run the model locally, follow these steps:
-
Clone the repository:
git clone https://github.com/justinpinkney/stable-diffusion.git cd stable-diffusion
-
Download the model checkpoint:
mkdir -p models/ldm/stable-diffusion-v1 wget https://huggingface.co/lambdalabs/stable-diffusion-image-conditioned/resolve/main/sd-clip-vit-l14-img-embed_ema_only.ckpt -O models/ldm/stable-diffusion-v1/sd-clip-vit-l14-img-embed_ema_only.ckpt
-
Install the required packages:
pip install -r requirements.txt
-
Run the Gradio variations script:
python scripts/gradio_variations.py
For enhanced performance, consider using cloud GPUs such as those offered by Lambda GPU Cloud.
License
The model is distributed under an "other" license. Users should review the specific licensing terms provided in the model repository to ensure compliance.