Qwen2vl Flux
DjrangoQwen2vl-Flux
Introduction
Qwen2vl-Flux is a state-of-the-art multimodal image generation model that combines the FLUX framework with Qwen2VL's vision-language understanding capabilities. It excels at generating high-quality images from text prompts and visual references, providing enhanced multimodal comprehension and control.
Architecture
The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, featuring:
- Vision-Language Understanding Module (Qwen2VL)
- Enhanced FLUX Backbone
- Multi-mode Generation Pipeline
- Structural Control Integration
Training
Qwen2vl-Flux utilizes Qwen2VL for superior multimodal comprehension. It supports multiple generation modes, including image variation, img2img, inpainting, and controlnet-guided generation. The model features structural control with depth estimation and line detection, alongside flexible spatial attention mechanisms for focused generation.
Guide: Running Locally
-
Clone the Repository and Install Dependencies:
git clone https://github.com/erwold/qwen2vl-flux cd qwen2vl-flux pip install -r requirements.txt
-
Download Model Checkpoints:
from huggingface_hub import snapshot_download snapshot_download("Djrango/Qwen2vl-Flux")
-
Basic Example Usage:
from model import FluxModel # Initialize model model = FluxModel(device="cuda") # Image Variation outputs = model.generate( input_image_a=input_image, prompt="Your text prompt", mode="variation" )
-
Technical Specifications:
- Framework: PyTorch 2.4.1+
- Memory Requirements: 48GB+ VRAM
- Supported Image Sizes: Various aspect ratios up to 1536x1024
Consider using cloud GPUs, such as those from AWS, Google Cloud, or Azure, to meet the high VRAM requirements.
License
- The model is a derivative work based on:
- FLUX.1 [dev] (Non-Commercial License)
- Qwen2-VL (Apache 2.0)
- It inherits non-commercial license restrictions from FLUX.1 [dev].
- For commercial use, contact the FLUX team.