Hermes 3 Llama 3.1 8 B G G U F
NousResearchIntroduction
Hermes 3 is the latest iteration in the Hermes series of large language models (LLMs) developed by Nous Research. It is designed as a generalist language model with significant enhancements over its predecessor, Hermes 2. The model offers features such as advanced agentic capabilities, improved roleplaying, multi-turn conversation, and long context coherence. Hermes 3 emphasizes user alignment, providing users with powerful control and steering capabilities.
Architecture
Hermes 3 is based on the Llama-3 architecture and is available in an 8B GGUF quantized version. It builds upon the Meta-Llama-3.1-8B base model and is designed for use with the llama.cpp framework. The model supports structured output capabilities, reliable function calling, and improved code generation skills.
Training
The Hermes 3 model was trained and fine-tuned to enhance its general capabilities, including roleplaying and conversational abilities. It leverages synthetic data and distillation techniques to improve its performance. The training process incorporates various prompt formats, such as ChatML, to enable structured multi-turn dialogue and function calling.
Guide: Running Locally
To run Hermes 3 locally, follow these steps:
-
Install Required Packages: Ensure you have
pytorch
,transformers
,bitsandbytes
,sentencepiece
,protobuf
, andflash-attn
installed. -
Load the Model:
import torch from transformers import AutoTokenizer, LlamaForCausalLM tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.1-8B', trust_remote_code=True) model = LlamaForCausalLM.from_pretrained( "NousResearch/Hermes-3-Llama-3.1-8B", torch_dtype=torch.float16, device_map="auto", load_in_8bit=False, load_in_4bit=True, use_flash_attention_2=True )
-
Generate Responses:
prompts = ["<|im_start|>system You are a sentient, superintelligent artificial general intelligence...<|im_end|>"] input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids.to("cuda") generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True) response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True) print(response)
-
Alternative Setup with vLLM:
- Install
vllm
and run the model with:vllm serve NousResearch/Hermes-3-Llama-3.1-8B
- Install
Cloud GPUs
For better performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure to handle the computational demands of running large models like Hermes 3.
License
Hermes 3 is licensed under the llama3
license. Please refer to the official documentation and license agreement for detailed terms and conditions.