gemma 2 27b it F P8 Dynamic

ThatsGroes

Introduction

GEMMA-2-27B-IT-FP8-DYNAMIC is a quantized version of the original google/gemma-2-27b-it model. This version has been optimized using FP8 dynamic quantization techniques, with support from Arrow and Nvidia through the Danish Data Science Community.

Architecture

The model utilizes the W8A8 FP-8 quantization scheme applied to the google/gemma-2-27b-it architecture. This process involves using specific quantization modifiers targeting linear layers while ignoring components such as the language model head.

Training

The quantization of this model was achieved using a script that applies oneshot quantization with a specific recipe. This involved using the SparseAutoModelForCausalLM class from llmcompressor.transformers and the QuantizationModifier class to configure the quantization to FP8 dynamic standards.

Guide: Running Locally

  1. Clone the Repository: Download the model files from the Hugging Face repository.
  2. Install Dependencies: Ensure that you have the necessary libraries including transformers and llmcompressor.
  3. Load the Model: Use the SparseAutoModelForCausalLM.from_pretrained method to load the model.
  4. Apply Quantization: Follow the script to configure and apply the quantization settings.
  5. Save the Model: Save the quantized model and tokenizer to a directory.

For optimal performance, especially when working with large models like GEMMA-2-27B, it is recommended to use cloud GPUs, such as those provided by AWS or Google Cloud.

License

The model is distributed under the gemma license. Ensure compliance with all terms and conditions outlined in the license.

More Related APIs