Hermes 3 Llama 3.2 3 B

NousResearch

Introduction

The Hermes 3 3B is a compact yet powerful addition to the Hermes series of large language models (LLMs) by Nous Research. It represents the first fine-tuning in its parameter class, offering advanced agent capabilities, improved roleplaying, reasoning, multi-turn conversation, and long-context coherence improvements over previous models.

Architecture

Hermes 3 3B is a fine-tuned version of the Llama-3.2 3B foundation model. It is engineered to align LLMs with user needs, providing enhanced steering and control capabilities. This model expands on Hermes 2's features, with improved function calling, structured output capabilities, generalist assistant skills, and code generation.

Training

The model was trained using H100 GPUs on the LambdaLabs GPU Cloud. The training process focused on enhancing the model's agentic abilities and ensuring alignment with user interactions.

Guide: Running Locally

  1. Environment Setup: Ensure you have Python and PyTorch installed, along with the transformers library from Hugging Face.

  2. Installation:

    pip install torch transformers bitsandbytes flash_attn
    
  3. Load the Model:

    from transformers import AutoTokenizer, LlamaForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.2-3B', trust_remote_code=True)
    model = LlamaForCausalLM.from_pretrained("NousResearch/Hermes-3-Llama-3.2-3B", torch_dtype=torch.float16, device_map="auto", load_in_8bit=False, load_in_4bit=True, use_flash_attention_2=True)
    
  4. Generate Text:

    prompts = ["<|im_start|>system You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|><|im_start|>user Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world.<|im_end|><|im_start|>assistant"]
    input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids.to("cuda")
    generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
    print(f"Response: {response}")
    
  5. Recommendation: For optimal performance, utilize cloud GPUs like those provided by LambdaLabs.

License

The model is released under the llama3 license. Please review the license terms to ensure compliance with usage policies.

More Related APIs in Text Generation