Phi 3 vision 128k instruct

microsoft

Introduction
Phi-3-Vision-128K-Instruct is an advanced, multimodal model designed by Microsoft, combining text and vision capabilities. It is part of the Phi-3 model family, supporting up to 128K tokens and optimized for precise instruction adherence and safety. It facilitates general AI systems with visual and text inputs, particularly focusing on efficient language and multimodal models.

Architecture
The model features 4.2 billion parameters, integrating an image encoder, connector, projector, and the Phi-3 Mini language model. It processes both text and images and is optimized for chat-format prompts. The model's architecture supports a context length of 128K tokens, offering robust performance in memory and compute-constrained environments.

Training
The model was trained using 512 H100-80G GPUs over 1.5 days, processing 500 billion vision and text tokens. The training data includes publicly available documents, high-quality educational data, synthetic "textbook-like" data, and supervised data in chat format. The training datasets are meticulously filtered to ensure privacy and quality.

Guide: Running Locally

  1. Environment Setup: Ensure you have Python and the necessary libraries installed, including transformers, torch, flash_attn, numpy, Pillow, and requests.
  2. Model Installation: Use pip to install the development version of transformers if not using the official release:
    pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers
    
  3. Run the Model: Load the model using the AutoModelForCausalLM and AutoProcessor classes from transformers.
  4. Sample Code Execution: Follow the provided code snippet to process images and generate text.
  5. GPU Recommendation: Use cloud GPUs like NVIDIA A100, A6000, or H100 for optimal performance.

License
The model is available under the MIT license, allowing for broad use with minimal restrictions. For further details, refer to the license document.

More Related APIs in Text Generation