Phi 3.5 vision instruct
microsoftIntroduction
Phi-3.5-vision is a state-of-the-art open multimodal model designed for high-quality reasoning in text and vision. It supports 128K tokens and is part of the Phi-3 model family. This model has undergone enhancement through supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Architecture
Phi-3.5-vision consists of 4.2 billion parameters, including an image encoder, connector, projector, and a Phi-3 Mini language model. It's designed for text and image inputs and supports a 128K context length. Training involved 256 A100-80G GPUs over six days on a 500 billion token dataset.
Training
Training data includes public documents, high-quality educational data, synthetic textbook-like data, and chat format supervised data. The model focuses on teaching math, coding, common sense reasoning, and general knowledge. It incorporates a safety post-training approach using SFT and RLHF with human-labeled datasets.
Guide: Running Locally
-
Install Required Packages:
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
-
Load the Model:
from PIL import Image import requests from transformers import AutoModelForCausalLM, AutoProcessor model_id = "microsoft/Phi-3.5-vision-instruct" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2' ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, num_crops=4)
-
Run Inference:
- Load images and prepare prompts using the processor.
- Generate responses using the model's
generate
method.
-
Cloud GPU Suggestion: For optimal performance, consider using NVIDIA GPUs like A100, A6000, or H100.
License
The model is licensed under the MIT License, allowing for broad use and modification. Please refer to the MIT License for more details.