Phi 3 mini 128k instruct

microsoft

Introduction

The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter model, part of the Phi-3 family, designed for text generation tasks. It uses a combination of synthetic and high-quality public datasets for training, focusing on reasoning capabilities. This model demonstrates strong performance for its size and is intended for use in constrained environments and latency-bound scenarios.

Architecture

Phi-3-Mini-128K-Instruct is a dense decoder-only Transformer model. It supports a context length of 128K tokens and is optimized through supervised fine-tuning and direct preference optimization to align with human preferences and safety guidelines. The model employs flash attention by default, which requires specific GPU hardware.

Training

The model underwent training across 10 days using 512 H100-80G GPUs on a dataset comprising 4.9 trillion tokens. Training data included public documents, synthetic data, and high-quality supervised chat format data. The focus was on improving reasoning abilities, with rigorous filtering to prioritize data that enhances reasoning over raw knowledge.

Guide: Running Locally

To run the Phi-3-Mini-128K-Instruct model locally:

  1. Install Dependencies:

    • Use the development version of transformers library:
      pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers
      
    • Ensure required packages are installed:
      pip install torch==2.3.1 accelerate==0.31.0 flash_attn==2.5.8
      
  2. Load the Model:

    • Ensure trust_remote_code=True when using from_pretrained() function:
      from transformers import AutoModelForCausalLM, AutoTokenizer
      model = AutoModelForCausalLM.from_pretrained(
          "microsoft/Phi-3-mini-128k-instruct", 
          device_map="cuda", 
          torch_dtype="auto", 
          trust_remote_code=True
      )
      
  3. Run Inference:

    • Example code for generating text using a pipeline:
      from transformers import pipeline
      pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
      output = pipe(messages, max_new_tokens=500, return_full_text=False, temperature=0.0, do_sample=False)
      print(output[0]['generated_text'])
      
  4. Cloud GPUs:

    • For optimal performance, use NVIDIA A100, A6000, or H100. If using older GPUs like V100, adjust the attention implementation to "eager."

License

The model is released under the MIT License. For more details, refer to the license document.

More Related APIs in Text Generation