Llama 3.3 70 B Instruct F P8 Dynamic
cortecsIntroduction
The Llama-3.3-70B-Instruct-FP8-Dynamic is a quantized version of the Meta Llama 3.3 model. It is a cutting-edge multilingual large language model with 70 billion parameters, designed for high performance in generative text tasks. The model excels in multilingual dialogues, supporting English and seven other languages, and outperforms many other models in quality and safety.
Architecture
The model features a sophisticated architecture tuned for multilingual instruction-based tasks. It delivers impressive results across various benchmarks, making it suitable for diverse AI applications. The quantization to FP8-Dynamic enhances its efficiency and allows for higher throughput.
Training
The model was pretrained and instruction-tuned to achieve superior generative capabilities. It exhibits an accuracy recovery of 99.67% and maintains high performance across several benchmarks, such as Arc, Hellaswag, and MMLU, in multiple languages.
Guide: Running Locally
To run the model locally, follow these steps:
- Install vLLM: Ensure you have vLLM installed to manage and serve the model.
- Run the Server: Use the following command to initiate the server:
python -m vllm.entrypoints.openai.api_server --model cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic --max-model-len 9000 --gpu-memory-utilization 0.95
- Access the Model: You can make requests to the model using cURL:
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic", "prompt": "San Francisco is a" }'
For optimal performance, consider using cloud GPUs, such as NVIDIA H100, to handle heavy workloads and achieve a throughput of 1485 tokens per second.
License
The model is covered under a specific license, which can be reviewed at Meta Llama 3 License.