Introduction

Kandinsky 3.0 is an open-source text-to-image diffusion model developed as an enhancement of the Kandinsky2-x model family. It is designed to generate images with a focus on Russian cultural elements. This version improves text comprehension and visual quality by enlarging the text encoder and the Diffusion U-Net models.

Architecture

The model architecture comprises three main components:

  • Text Encoder Flan-UL2: An 8.6 billion parameter encoder.
  • Latent Diffusion U-Net: A 3 billion parameter network.
  • MoVQ Encoder/Decoder: Comprising 267 million parameters.

Training

Two models are released:

  • Base Model: Trained over 2 million steps using 400 A100 GPUs.
  • Inpainting Model: Initialized from the base model's final checkpoint and further trained for 250,000 steps on 300 A100 GPUs.

Guide: Running Locally

Installation

To run the Kandinsky 3.0 model locally, you need to install the following libraries:

pip install git+https://github.com/huggingface/diffusers.git
pip install --upgrade transformers accelerate

Text-to-Image Generation

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]

Image-to-Image Generation

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch

pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

prompt = "A painting of the inside of a subway train with tiny raccoons."
image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png")

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]

Cloud GPUs

For optimal performance, consider using cloud-based GPU solutions like AWS, Google Cloud, or Azure.

License

Kandinsky 3.0 is licensed under the Apache 2.0 License, allowing for broad use and modification.

More Related APIs in Text To Image