Hermes 3 Llama 3.1 8 B G G U F

NousResearch

Introduction

Hermes 3 is the latest iteration in the Hermes series of large language models (LLMs) developed by Nous Research. It is designed as a generalist language model with significant enhancements over its predecessor, Hermes 2. The model offers features such as advanced agentic capabilities, improved roleplaying, multi-turn conversation, and long context coherence. Hermes 3 emphasizes user alignment, providing users with powerful control and steering capabilities.

Architecture

Hermes 3 is based on the Llama-3 architecture and is available in an 8B GGUF quantized version. It builds upon the Meta-Llama-3.1-8B base model and is designed for use with the llama.cpp framework. The model supports structured output capabilities, reliable function calling, and improved code generation skills.

Training

The Hermes 3 model was trained and fine-tuned to enhance its general capabilities, including roleplaying and conversational abilities. It leverages synthetic data and distillation techniques to improve its performance. The training process incorporates various prompt formats, such as ChatML, to enable structured multi-turn dialogue and function calling.

Guide: Running Locally

To run Hermes 3 locally, follow these steps:

  1. Install Required Packages: Ensure you have pytorch, transformers, bitsandbytes, sentencepiece, protobuf, and flash-attn installed.

  2. Load the Model:

    import torch
    from transformers import AutoTokenizer, LlamaForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.1-8B', trust_remote_code=True)
    model = LlamaForCausalLM.from_pretrained(
        "NousResearch/Hermes-3-Llama-3.1-8B",
        torch_dtype=torch.float16,
        device_map="auto",
        load_in_8bit=False,
        load_in_4bit=True,
        use_flash_attention_2=True
    )
    
  3. Generate Responses:

    prompts = ["<|im_start|>system You are a sentient, superintelligent artificial general intelligence...<|im_end|>"]
    input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids.to("cuda")
    generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True)
    response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
    print(response)
    
  4. Alternative Setup with vLLM:

    • Install vllm and run the model with:
      vllm serve NousResearch/Hermes-3-Llama-3.1-8B
      

Cloud GPUs

For better performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure to handle the computational demands of running large models like Hermes 3.

License

Hermes 3 is licensed under the llama3 license. Please refer to the official documentation and license agreement for detailed terms and conditions.

More Related APIs