Qwen2vl-Flux

Introduction

Qwen2vl-Flux is a state-of-the-art multimodal image generation model that combines the FLUX framework with Qwen2VL's vision-language understanding capabilities. It excels at generating high-quality images from text prompts and visual references, providing enhanced multimodal comprehension and control.

Architecture

The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, featuring:

Vision-Language Understanding Module (Qwen2VL)
Enhanced FLUX Backbone
Multi-mode Generation Pipeline
Structural Control Integration

Training

Qwen2vl-Flux utilizes Qwen2VL for superior multimodal comprehension. It supports multiple generation modes, including image variation, img2img, inpainting, and controlnet-guided generation. The model features structural control with depth estimation and line detection, alongside flexible spatial attention mechanisms for focused generation.

Guide: Running Locally

Clone the Repository and Install Dependencies:

git clone https://github.com/erwold/qwen2vl-flux
cd qwen2vl-flux
pip install -r requirements.txt

Download Model Checkpoints:

from huggingface_hub import snapshot_download
snapshot_download("Djrango/Qwen2vl-Flux")

Basic Example Usage:

from model import FluxModel

# Initialize model
model = FluxModel(device="cuda")

# Image Variation
outputs = model.generate(
    input_image_a=input_image,
    prompt="Your text prompt",
    mode="variation"
)

Technical Specifications:
- Framework: PyTorch 2.4.1+
- Memory Requirements: 48GB+ VRAM
- Supported Image Sizes: Various aspect ratios up to 1536x1024

Consider using cloud GPUs, such as those from AWS, Google Cloud, or Azure, to meet the high VRAM requirements.

License

The model is a derivative work based on:
- FLUX.1 [dev] (Non-Commercial License)
- Qwen2-VL (Apache 2.0)
It inherits non-commercial license restrictions from FLUX.1 [dev].
For commercial use, contact the FLUX team.