S D3.5 Large I P Adapter

InstantX

Introduction

The SD3.5-LARGE-IP-ADAPTER is an IP-Adapter developed for the SD3.5-Large model by the InstantX Team. It supports text-to-image generation, leveraging the Stable Diffusion 3 Pipeline for enhanced image creation.

Architecture

The IP-Adapter integrates new layers into all 38 blocks of the model. It uses the google/siglip-so400m-patch14-384 for image encoding due to its superior performance, and employs a TimeResampler to project images. The model sets the number of image tokens to 64.

Training

While specific training details are not provided, the architecture suggests that enhancements focus on integrating additional layers and utilizing advanced encoding techniques to improve image generation quality.

Guide: Running Locally

To run the SD3.5-Large-IP-Adapter locally, follow these steps:

  1. Install Necessary Libraries: Ensure that you have torch and PIL (Python Imaging Library) installed.
  2. Download Model Files: Obtain the model, IP-Adapter, and image encoder from the specified paths.
  3. Load Model: Use the provided code snippet to load and initialize the model:
    import torch
    from PIL import Image
    from models.transformer_sd3 import SD3Transformer2DModel
    from pipeline_stable_diffusion_3_ipa import StableDiffusion3Pipeline
    
    model_path = 'stabilityai/stable-diffusion-3.5-large'
    ip_adapter_path = './ip-adapter.bin'
    image_encoder_path = "google/siglip-so400m-patch14-384"
    
    transformer = SD3Transformer2DModel.from_pretrained(
        model_path, subfolder="transformer", torch_dtype=torch.bfloat16
    )
    
    pipe = StableDiffusion3Pipeline.from_pretrained(
        model_path, transformer=transformer, torch_dtype=torch.bfloat16
    ).to("cuda")
    
    pipe.init_ipadapter(
        ip_adapter_path=ip_adapter_path, 
        image_encoder_path=image_encoder_path, 
        nb_token=64, 
    )
    
  4. Generate Images: Use the pipeline to generate images with specific prompts and settings:
    ref_img = Image.open('./assets/1.jpg').convert('RGB')
    
    image = pipe(
        width=1024,
        height=1024,
        prompt='a cat',
        negative_prompt="lowres, low quality, worst quality",
        num_inference_steps=24, 
        guidance_scale=5.0,
        generator=torch.Generator("cuda").manual_seed(42),
        clip_image=ref_img,
        ipadapter_scale=0.5,
    ).images[0]
    image.save('./result.jpg')
    
  5. Environment: It is recommended to utilize a cloud GPU for enhanced performance, such as those offered by AWS or Google Cloud.

License

The model is released under the stabilityai-ai-community license. For more details, refer to the license link. All rights are reserved.

More Related APIs in Text To Image