Qwen2vl Flux

Djrango

Qwen2vl-Flux

Introduction

Qwen2vl-Flux is a state-of-the-art multimodal image generation model that combines the FLUX framework with Qwen2VL's vision-language understanding capabilities. It excels at generating high-quality images from text prompts and visual references, providing enhanced multimodal comprehension and control.

Architecture

The model integrates Qwen2VL's vision-language capabilities into the FLUX framework, featuring:

  • Vision-Language Understanding Module (Qwen2VL)
  • Enhanced FLUX Backbone
  • Multi-mode Generation Pipeline
  • Structural Control Integration

Training

Qwen2vl-Flux utilizes Qwen2VL for superior multimodal comprehension. It supports multiple generation modes, including image variation, img2img, inpainting, and controlnet-guided generation. The model features structural control with depth estimation and line detection, alongside flexible spatial attention mechanisms for focused generation.

Guide: Running Locally

  1. Clone the Repository and Install Dependencies:

    git clone https://github.com/erwold/qwen2vl-flux
    cd qwen2vl-flux
    pip install -r requirements.txt
    
  2. Download Model Checkpoints:

    from huggingface_hub import snapshot_download
    snapshot_download("Djrango/Qwen2vl-Flux")
    
  3. Basic Example Usage:

    from model import FluxModel
    
    # Initialize model
    model = FluxModel(device="cuda")
    
    # Image Variation
    outputs = model.generate(
        input_image_a=input_image,
        prompt="Your text prompt",
        mode="variation"
    )
    
  4. Technical Specifications:

    • Framework: PyTorch 2.4.1+
    • Memory Requirements: 48GB+ VRAM
    • Supported Image Sizes: Various aspect ratios up to 1536x1024

Consider using cloud GPUs, such as those from AWS, Google Cloud, or Azure, to meet the high VRAM requirements.

License

  • The model is a derivative work based on:
    • FLUX.1 [dev] (Non-Commercial License)
    • Qwen2-VL (Apache 2.0)
  • It inherits non-commercial license restrictions from FLUX.1 [dev].
  • For commercial use, contact the FLUX team.

More Related APIs in Text To Image