Code Llama 13 B M O Repair G G U F
bartowskiIntroduction
The CodeLlama-13B-MORepair-GGUF model is a quantized version of the CodeLlama-13B model, optimized using the llama.cpp framework. It is designed for text generation tasks and is compatible with various inference endpoints.
Architecture
The model utilizes quantizations through the imatrix option, providing a range of file sizes and quality levels to suit different hardware capabilities. These quantizations adjust the model's weights to optimize performance for specific hardware, such as ARM chips or AVX2/AVX512 CPUs.
Training
The model is based on the original CodeLlama-13B from Hugging Face and uses a custom calibration dataset to achieve its quantization. The quantization process aims to maintain high performance while reducing the model size, making it more efficient for deployment on various hardware configurations.
Guide: Running Locally
- Installation: Ensure
huggingface-cli
is installed via pip:pip install -U "huggingface_hub[cli]"
- Download the Model: Use the CLI to download the desired quantized model file:
huggingface-cli download bartowski/CodeLlama-13B-MORepair-GGUF --include "CodeLlama-13B-MORepair-Q4_K_M.gguf" --local-dir ./
- Select the Right File: Choose a quantization file that fits your hardware's RAM or VRAM capacity. K-quants are generally recommended for most users, while I-quants offer better performance on specific setups like cuBLAS or rocBLAS.
- Run in LM Studio: Use LM Studio for execution, which supports various hardware configurations, including ARM and AVX.
Cloud GPUs: Consider using cloud services with NVIDIA GPUs to leverage cuBLAS optimizations for improved performance.
License
The model is released under the Llama 2 license, which governs its use and distribution. Users should review the license terms to ensure compliance with its conditions.