Emu3 Gen
BAAIIntroduction
Emu3 is a state-of-the-art multimodal model suite developed by the Beijing Academy of Artificial Intelligence (BAAI). It is designed to handle any-to-any transformations by training a single transformer on multimodal sequences using next-token prediction. Emu3 excels in both generation and perception tasks, outperforming several well-known models without needing diffusion or compositional architectures.
Architecture
Emu3 tokenizes images, text, and videos into a discrete space, enabling the training of a transformer model from scratch. It supports flexible resolutions and styles in image generation and provides coherent text responses without relying on CLIP or pretrained large language models (LLMs). Emu3 also generates video sequences causally, predicting the next token without using video diffusion models.
Training
The model is trained by predicting the next token in a sequence, whether it be for text, images, or videos. This training method eliminates the need for more complex architectures, allowing Emu3 to perform effectively across a variety of tasks.
Guide: Running Locally
To run Emu3 locally, follow these steps:
-
Environment Setup:
- Install necessary libraries such as
transformers
andtorch
. - Ensure you have access to a GPU for optimal performance.
- Install necessary libraries such as
-
Model Preparation:
- Load the model and processors using Hugging Face's
transformers
library. - Use the
AutoModelForCausalLM
class to load the Emu3 model from the Hugging Face Hub.
- Load the model and processors using Hugging Face's
-
Input Preparation:
- Define positive and negative prompts.
- Use the
Emu3Processor
to process these inputs, specifying parameters like mode, ratio, and image area.
-
Hyperparameters and Generation:
- Create a
GenerationConfig
object to set generation parameters such asmax_new_tokens
andtop_k
. - Use
LogitsProcessorList
to apply constraints and guidance during generation.
- Create a
-
Output Generation:
- Call the
generate
method on the model, passing in the processed inputs and generation configuration. - Decode and save the generated images using the processor.
- Call the
Suggested Cloud GPUs
For better performance, consider using cloud GPU services such as Google Cloud's GPU offerings, Amazon Web Services (AWS), or Microsoft Azure.
License
Emu3 is released under the Apache 2.0 License, allowing for extensive freedom in usage and modification. Please review the full license text for further details.