Florence 2 Flux Large

gokaygokay

Introduction

Florence-2-Flux-Large is a model designed for image-text-to-text transformation tasks. It utilizes the transformers library and supports processing in English. The model is particularly suited for tasks that combine text generation with art and custom code applications.

Architecture

The model is built on the base model microsoft/Florence-2-large and leverages the capabilities of the transformers library. It employs a causal language model architecture, enabling it to generate text based on image inputs and textual prompts.

Training

The model uses the kadirnar/fluxdev_controlnet_16k dataset for training, which facilitates its proficiency in handling tasks that require detailed image-to-text transformation.

Guide: Running Locally

To run Florence-2-Flux-Large locally, follow these steps:

  1. Install Dependencies:

    pip install -q datasets flash_attn timm einops
    
  2. Set Up Model and Processor:

    from transformers import AutoModelForCausalLM, AutoProcessor
    import torch
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    model = AutoModelForCausalLM.from_pretrained("gokaygokay/Florence-2-Flux-Large", trust_remote_code=True).to(device).eval()
    processor = AutoProcessor.from_pretrained("gokaygokay/Florence-2-Flux-Large", trust_remote_code=True)
    
  3. Run Example:

    from PIL import Image
    import requests
    
    def run_example(task_prompt, text_input, image):
        prompt = task_prompt + text_input
        if image.mode != "RGB":
            image = image.convert("RGB")
        inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3,
            repetition_penalty=1.10,
        )
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
        parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
        return parsed_answer
    
    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
    image = Image.open(requests.get(url, stream=True).raw)
    answer = run_example("<DESCRIPTION>", "Describe this image in great detail.", image)
    final_answer = answer["<DESCRIPTION>"]
    print(final_answer)
    
  4. Consider Using Cloud GPUs: For optimal performance, especially with large models, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The Florence-2-Flux-Large model is distributed under the Apache-2.0 license. This allows for both personal and commercial use, modifications, and distribution of the software, provided that the original license terms are met.

More Related APIs in Image Text To Text