Wizard Vicuna 30 B Uncensored G P T Q
TheBlokeIntroduction
The Wizard-Vicuna-30B-Uncensored-GPTQ is a language model based on Eric Hartford's Wizard Vicuna 30B model. It is designed to provide helpful, detailed, and polite responses in a conversational format. The model is available in various quantized forms to optimize performance across different hardware configurations.
Architecture
The model is built on the Llama architecture, offering various quantization options, including 2, 3, 4, 5, 6, and 8-bit precision. These options cater to different VRAM requirements and inference quality needs. Each quantization setting affects the model's performance and memory usage, enabling customization based on user requirements.
Training
The model was created by adjusting the original Wizard-Vicuna-30B-Uncensored model using GPTQ quantization techniques. It uses a dataset called wikitext for quantization, distinct from the dataset used for the original model training. The quantization process is designed to maintain high accuracy while reducing computational and memory demands.
Guide: Running Locally
Basic Steps
-
Install Prerequisites: Ensure you have Python installed along with
transformers
,optimum
, andauto-gptq
packages.pip3 install transformers>=4.32.0 optimum>=1.12.0 pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
-
Model Download: Use the
text-generation-webui
for easy setup. Choose the desired branch for specific quantization.git clone --single-branch --branch main https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ
-
Load the Model: Use Python to load the model and tokenizer.
from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ", device_map="auto") tokenizer = AutoTokenizer.from_pretrained("TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ", use_fast=True)
-
Generate Responses: Use the model to generate text.
prompt = "Tell me about AI" input_ids = tokenizer(prompt, return_tensors='pt').input_ids output = model.generate(inputs=input_ids) print(tokenizer.decode(output[0]))
Cloud GPUs
For optimal performance, it is recommended to use cloud services that provide GPUs, such as AWS, Google Cloud, or Azure, especially for larger models or higher quantization settings.
License
This model is licensed under a custom license. Please refer to the original repository for specific terms and conditions. The model's use is subject to responsibilities akin to handling a potentially hazardous tool; users are accountable for its deployment and any content generated.