Janus 1.3 B
deepseek-aiIntroduction
Janus is a novel autoregressive framework designed for unified multimodal understanding and generation. It addresses the limitations of previous methods by decoupling visual encoding into separate pathways, while maintaining a unified transformer architecture for processing. This approach resolves conflicts between visual encoding roles in understanding and generation, enhancing flexibility. Janus surpasses previous unified models and matches or exceeds the performance of task-specific models, making it a strong candidate for next-generation multimodal models.
Architecture
Janus is based on the DeepSeek-LLM-1.3b-base, trained on approximately 500 billion text tokens. It uses the SigLIP-L as the vision encoder for multimodal understanding, supporting 384 x 384 image input. For image generation, it utilizes a tokenizer with a downsample rate of 16, sourced from LlamaGen.
Training
Janus builds upon the DeepSeek-LLM-1.3b-base and integrates the SigLIP-L vision encoder. The model's training involves processing a large corpus of text and visual data, leveraging the decoupled visual encoding pathways to enhance performance in both understanding and generation tasks.
Guide: Running Locally
To run Janus locally, perform the following steps:
- Clone the GitHub repository.
- Set up the required environment and dependencies as listed in the repository.
- Download the model weights and tokenizer configurations.
- Execute the model using sample scripts provided in the repository.
For optimal performance, it is recommended to utilize cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The Janus code repository is licensed under the MIT License. The use of Janus models is subject to the DeepSeek Model License.