Florence 2 large

microsoft

Introduction
Florence-2 is an advanced vision foundation model developed by Microsoft, designed to handle a variety of vision and vision-language tasks. Utilizing a prompt-based approach, it performs tasks such as captioning, object detection, and segmentation. The model is built on the FLD-5B dataset, which includes 5.4 billion annotations across 126 million images, allowing for superior multi-task learning capabilities. Its sequence-to-sequence architecture supports both zero-shot and fine-tuned settings, making it a competitive choice in the vision foundation model landscape.

Architecture
Florence-2 employs a sequence-to-sequence architecture that facilitates its performance across a variety of tasks. The model's design enables it to interpret simple text prompts and execute complex vision tasks. This architecture allows for efficient zero-shot and fine-tuned task performance, leveraging the expansive FLD-5B dataset for training.

Training
The Florence-2 model is pretrained on the FLD-5B dataset and is available in different configurations, such as Florence-2-base and Florence-2-large. These models can be further fine-tuned on specific tasks to enhance their performance. The continued pretraining uses a context length of 4k, but only 0.1B samples were used, indicating potential areas for further training and improvement.

Guide: Running Locally

  1. Setup: Ensure you have Python and PyTorch installed. Use a virtual environment for isolation.
  2. Install Transformers: Run pip install transformers to get the necessary library.
  3. Load the Model:
    import torch
    from transformers import AutoProcessor, AutoModelForCausalLM
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
    model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
    processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
    
  4. Run Inference: Use the processor and model to run tasks with image inputs and text prompts.
  5. Cloud GPUs: For optimal performance, use cloud services like AWS, GCP, or Azure with GPU support to handle the intensive computations efficiently.

License
The Florence-2 model is released under the MIT License. For more details, refer to the license file.

More Related APIs in Image Text To Text