Nera_ Noctis 12 B G G U F
bartowskiIntroduction
The Nera_Noctis-12B-GGUF is a text generation model designed for English language applications. It is a quantized version of the original Nera_Noctis-12B model, optimized for efficient inference using the llama.cpp quantization framework.
Architecture
The model leverages various quantization techniques to compress the original Nera_Noctis-12B model, allowing it to run efficiently on different hardware configurations. The quantization process uses the llama.cpp framework, specifically release b4404, and offers multiple quantization levels such as Q8_0, Q6_K, and Q4_K, each tailored for different use-case scenarios and system capabilities.
Training
The quantized models were trained using the imatrix option with a dedicated dataset for optimal performance. The quantization process focuses on maintaining the balance between model size and performance, ensuring high-quality output even with reduced model sizes.
Guide: Running Locally
-
Setup Environment
- Install the
huggingface_hub
CLI tool:pip install -U "huggingface_hub[cli]"
- Install the
-
Download Model
- Choose the desired quantized file based on your system's RAM and VRAM capabilities. For high performance, select a model size that fits within your VRAM.
- Use the
huggingface-cli
to download the specific model file:huggingface-cli download bartowski/Nera_Noctis-12B-GGUF --include "Nera_Noctis-12B-Q4_K_M.gguf" --local-dir ./
-
Run Model
- Execute the model using a suitable inference environment, such as LM Studio, which supports llama.cpp quantizations.
-
Hardware Considerations
- For enhanced performance, consider using cloud GPUs like NVIDIA's cuBLAS or AMD's rocBLAS. I-quants are recommended for systems using these GPUs because they offer better performance for their size.
License
The model is distributed under an "other" license, and users should refer to the specific terms and conditions outlined by the model creator when using it.