Introduction

Emu3 is a state-of-the-art multimodal model suite developed by the Beijing Academy of Artificial Intelligence (BAAI). It is designed to handle any-to-any transformations by training a single transformer on multimodal sequences using next-token prediction. Emu3 excels in both generation and perception tasks, outperforming several well-known models without needing diffusion or compositional architectures.

Architecture

Emu3 tokenizes images, text, and videos into a discrete space, enabling the training of a transformer model from scratch. It supports flexible resolutions and styles in image generation and provides coherent text responses without relying on CLIP or pretrained large language models (LLMs). Emu3 also generates video sequences causally, predicting the next token without using video diffusion models.

Training

The model is trained by predicting the next token in a sequence, whether it be for text, images, or videos. This training method eliminates the need for more complex architectures, allowing Emu3 to perform effectively across a variety of tasks.

Guide: Running Locally

To run Emu3 locally, follow these steps:

  1. Environment Setup:

    • Install necessary libraries such as transformers and torch.
    • Ensure you have access to a GPU for optimal performance.
  2. Model Preparation:

    • Load the model and processors using Hugging Face's transformers library.
    • Use the AutoModelForCausalLM class to load the Emu3 model from the Hugging Face Hub.
  3. Input Preparation:

    • Define positive and negative prompts.
    • Use the Emu3Processor to process these inputs, specifying parameters like mode, ratio, and image area.
  4. Hyperparameters and Generation:

    • Create a GenerationConfig object to set generation parameters such as max_new_tokens and top_k.
    • Use LogitsProcessorList to apply constraints and guidance during generation.
  5. Output Generation:

    • Call the generate method on the model, passing in the processed inputs and generation configuration.
    • Decode and save the generated images using the processor.

Suggested Cloud GPUs

For better performance, consider using cloud GPU services such as Google Cloud's GPU offerings, Amazon Web Services (AWS), or Microsoft Azure.

License

Emu3 is released under the Apache 2.0 License, allowing for extensive freedom in usage and modification. Please review the full license text for further details.

More Related APIs in Any To Any