Florence 2 large
microsoftIntroduction
Florence-2 is an advanced vision foundation model developed by Microsoft, designed to handle a variety of vision and vision-language tasks. Utilizing a prompt-based approach, it performs tasks such as captioning, object detection, and segmentation. The model is built on the FLD-5B dataset, which includes 5.4 billion annotations across 126 million images, allowing for superior multi-task learning capabilities. Its sequence-to-sequence architecture supports both zero-shot and fine-tuned settings, making it a competitive choice in the vision foundation model landscape.
Architecture
Florence-2 employs a sequence-to-sequence architecture that facilitates its performance across a variety of tasks. The model's design enables it to interpret simple text prompts and execute complex vision tasks. This architecture allows for efficient zero-shot and fine-tuned task performance, leveraging the expansive FLD-5B dataset for training.
Training
The Florence-2 model is pretrained on the FLD-5B dataset and is available in different configurations, such as Florence-2-base and Florence-2-large. These models can be further fine-tuned on specific tasks to enhance their performance. The continued pretraining uses a context length of 4k, but only 0.1B samples were used, indicating potential areas for further training and improvement.
Guide: Running Locally
- Setup: Ensure you have Python and PyTorch installed. Use a virtual environment for isolation.
- Install Transformers: Run
pip install transformers
to get the necessary library. - Load the Model:
import torch from transformers import AutoProcessor, AutoModelForCausalLM device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device) processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
- Run Inference: Use the processor and model to run tasks with image inputs and text prompts.
- Cloud GPUs: For optimal performance, use cloud services like AWS, GCP, or Azure with GPU support to handle the intensive computations efficiently.
License
The Florence-2 model is released under the MIT License. For more details, refer to the license file.