Introduction

DALL·E Mini is a transformer-based text-to-image generation model developed by Boris Dayma and others. It is an open-source attempt to replicate OpenAI's DALL·E model, intended for generating images from text prompts. The model is designed for research and personal use, supporting creativity and providing image generations based on user curiosity.

Architecture

The DALL·E Mini model uses a VQGAN encoder to convert images into token sequences, and a BART encoder to process text descriptions. These encoded forms are then decoded by a BART decoder, which predicts the next token in the sequence. The training involves a softmax cross-entropy loss function, comparing the generated predictions with actual image encodings.

Training

The training of DALL·E Mini utilized datasets such as Conceptual Captions, Conceptual 12M, and a subset of YFCC100M. Images were encoded using VQGAN, while descriptions were processed through BART. Training employed TPU v3-8 hardware over three days, with additional optimizations like gradient checkpointing and distributed shampoo optimizer.

Guide: Running Locally

  1. Setup Environment: Install necessary libraries such as transformers, jax, and flax.
  2. Download Model: Clone the DALL·E Mini repository from GitHub and download the pre-trained weights.
  3. Inference: Use text prompts to generate images using the model's inference script.
  4. Hardware Recommendations: For optimal performance, consider using cloud GPUs like NVIDIA A100 or TPU instances on Google Cloud.

License

DALL·E Mini is distributed under the Apache 2.0 License, allowing for modification and distribution under the same license terms.

More Related APIs in Text To Image