Phi 3 mini 128k instruct
microsoftIntroduction
The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter model, part of the Phi-3 family, designed for text generation tasks. It uses a combination of synthetic and high-quality public datasets for training, focusing on reasoning capabilities. This model demonstrates strong performance for its size and is intended for use in constrained environments and latency-bound scenarios.
Architecture
Phi-3-Mini-128K-Instruct is a dense decoder-only Transformer model. It supports a context length of 128K tokens and is optimized through supervised fine-tuning and direct preference optimization to align with human preferences and safety guidelines. The model employs flash attention by default, which requires specific GPU hardware.
Training
The model underwent training across 10 days using 512 H100-80G GPUs on a dataset comprising 4.9 trillion tokens. Training data included public documents, synthetic data, and high-quality supervised chat format data. The focus was on improving reasoning abilities, with rigorous filtering to prioritize data that enhances reasoning over raw knowledge.
Guide: Running Locally
To run the Phi-3-Mini-128K-Instruct model locally:
-
Install Dependencies:
- Use the development version of
transformers
library:pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers
- Ensure required packages are installed:
pip install torch==2.3.1 accelerate==0.31.0 flash_attn==2.5.8
- Use the development version of
-
Load the Model:
- Ensure
trust_remote_code=True
when usingfrom_pretrained()
function:from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-128k-instruct", device_map="cuda", torch_dtype="auto", trust_remote_code=True )
- Ensure
-
Run Inference:
- Example code for generating text using a pipeline:
from transformers import pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) output = pipe(messages, max_new_tokens=500, return_full_text=False, temperature=0.0, do_sample=False) print(output[0]['generated_text'])
- Example code for generating text using a pipeline:
-
Cloud GPUs:
- For optimal performance, use NVIDIA A100, A6000, or H100. If using older GPUs like V100, adjust the attention implementation to "eager."
License
The model is released under the MIT License. For more details, refer to the license document.