Florence 2 large Prompt Gen v2.0

MiaoshouAI

Introduction

Florence-2-large-PromptGen v2.0 is an advanced model from MiaoshouAI, enhancing the capabilities of its predecessor, PromptGen 1.5. This model focuses on improved image captioning and analysis, offering memory-efficient performance with high-quality results.

Architecture

The model uses a lightweight architecture that requires slightly more than 1GB of VRAM, enabling fast and efficient image captioning. It integrates with the Flux model to facilitate simultaneous usage of T5XXL CLIP and CLIP_L, streamlining the process of generating captions.

Training

Florence-2-large-PromptGen v2.0 provides several new features, including:

  • Enhanced caption quality for <GENERATE_TAGS>, <DETAILED_CAPTION>, and <MORE_DETAILED_CAPTION>.
  • A new <ANALYZE> instruction for comprehensive image composition analysis.
  • The addition of a <MIXED_CAPTION> instruction for combined detailed captions and tags.
  • <MIXED_CAPTION_PLUS> for enhanced mixed caption and analysis capabilities.

Guide: Running Locally

  1. Load the Model:
    Install the necessary libraries such as transformers and PIL. Use the following code to load the model and processor:

    model = AutoModelForCausalLM.from_pretrained("MiaoshouAI/Florence-2-large-PromptGen-v2.0", trust_remote_code=True)
    processor = AutoProcessor.from_pretrained("MiaoshouAI/Florence-2-large-PromptGen-v2.0", trust_remote_code=True)
    
  2. Prepare an Image:
    Load an image using a URL or from a local path:

    from PIL import Image
    import requests
    
    url = "YOUR_IMAGE_URL"
    image = Image.open(requests.get(url, stream=True).raw)
    
  3. Generate Captions:
    Use the processor to generate captions:

    prompt = "<MORE_DETAILED_CAPTION>"
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=prompt, image_size=(image.width, image.height))
    
    print(parsed_answer)
    
  4. Cloud GPUs:
    For optimal performance, consider running the model on cloud platforms that offer GPU support, such as AWS, Google Cloud, or Azure.

License

The Florence-2-large-PromptGen v2.0 model is released under the MIT License, allowing for flexibility in use and modification.

More Related APIs