Torii Gate v0.3
MinthyIntroduction
ToriiGate-v0.3 is an advanced model for image captioning, particularly designed for anime art. It builds on its predecessor, ToriiGate-v0.2, and the Idefics3 framework. This model excels in understanding a broad array of images, including single or multiple characters, intricate scenes, comics, manga, and culturally rich concepts. It uses booru-tags grounding for detailed and accurate descriptions and handles NSFW content effectively.
Architecture
ToriiGate-v0.3 is based on the Idefics3 model, specifically the HuggingFaceM4/Idefics3-8B-Llama3. It is multimodal, supporting vision and text-to-text transformations. The model provides structured output, which is advantageous for further natural language processing (NLP).
Training
The model is trained on a dataset of 120,000 diverse and balanced anime pictures, captioned and processed with tools like Claude 3.0 Opus, Claude 3.5 Sonet, and GPT-4o. The training focuses on achieving high zero-shot and grounded accuracy, capable of producing structured captions for comics frame-by-frame. It offers three modes of output: brief descriptions, detailed descriptions, and structured JSON-like format.
Guide: Running Locally
-
Environment Setup: Ensure you have Python and a suitable version of PyTorch installed.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt
-
Install Dependencies: For enhanced performance on Linux, consider installing Flash Attention-2. Ensure you have a development build of Transformers:
pip install git+https://github.com/huggingface/transformers
-
Inference Setup:
- Download the model using
huggingface_hub
. - Load the model using
AutoProcessor
andAutoModelForVision2Seq
. - Use a GPU for optimal performance; cloud GPUs like AWS, Google Cloud, or Azure can be beneficial.
- Download the model using
-
Example Script: Use the provided Python script to perform captioning on images by replacing the
user_prompt
variable with your desired instruction. -
VLLM Optimization: For faster inference, use VLLM, an optimized LLM serving engine. Install VLLM and use it to perform single or batch inference.
License
ToriiGate-v0.3 is licensed under the Apache-2.0 license, similar to the Idefics3 model. This allows for broad usage and modification, provided that proper credit is given and the same license terms are applied to any derivative works.