Hermes 3 Llama 3.2 3 B
NousResearchIntroduction
The Hermes 3 3B is a compact yet powerful addition to the Hermes series of large language models (LLMs) by Nous Research. It represents the first fine-tuning in its parameter class, offering advanced agent capabilities, improved roleplaying, reasoning, multi-turn conversation, and long-context coherence improvements over previous models.
Architecture
Hermes 3 3B is a fine-tuned version of the Llama-3.2 3B foundation model. It is engineered to align LLMs with user needs, providing enhanced steering and control capabilities. This model expands on Hermes 2's features, with improved function calling, structured output capabilities, generalist assistant skills, and code generation.
Training
The model was trained using H100 GPUs on the LambdaLabs GPU Cloud. The training process focused on enhancing the model's agentic abilities and ensuring alignment with user interactions.
Guide: Running Locally
-
Environment Setup: Ensure you have Python and PyTorch installed, along with the
transformers
library from Hugging Face. -
Installation:
pip install torch transformers bitsandbytes flash_attn
-
Load the Model:
from transformers import AutoTokenizer, LlamaForCausalLM tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.2-3B', trust_remote_code=True) model = LlamaForCausalLM.from_pretrained("NousResearch/Hermes-3-Llama-3.2-3B", torch_dtype=torch.float16, device_map="auto", load_in_8bit=False, load_in_4bit=True, use_flash_attention_2=True)
-
Generate Text:
prompts = ["<|im_start|>system You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|><|im_start|>user Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world.<|im_end|><|im_start|>assistant"] input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids.to("cuda") generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id) response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True) print(f"Response: {response}")
-
Recommendation: For optimal performance, utilize cloud GPUs like those provided by LambdaLabs.
License
The model is released under the llama3
license. Please review the license terms to ensure compliance with usage policies.