Hermes 3 Llama 3.1 405 B

NousResearch

Hermes-3-Llama-3.1-405B

Introduction

Hermes 3 405B is the latest flagship model in the Hermes series by Nous Research. It is a full parameter finetuned version of the Llama-3.1 405B foundation model, designed to provide advanced language model capabilities, including improved roleplaying, reasoning, multi-turn conversation, and structured output generation. This model focuses on user alignment, offering powerful steering capabilities and control to the end user.

Architecture

Hermes 3 builds on the capabilities of Hermes 2, with enhancements in function calling, structured output, generalist assistant capabilities, and code generation. It uses ChatML for prompt formatting, enabling structured multi-turn chat dialogues. The model is designed to align closely with user intents through system prompts and structured interaction formats.

Training

The training process for Hermes 3 involved full parameter finetuning, leveraging LambdaLabs' 1-Click Cluster for efficient training. The model is trained to support various structured prompts, including function calling and JSON mode outputs. It uses NeuralMagic's FP8 quantization to reduce VRAM requirements from over 800GB to around 430GB, compatible with the VLLM inference engine.

Guide: Running Locally

To run Hermes-3-Llama-3.1-405B locally, follow these steps:

  1. Install the required packages: PyTorch, Transformers, bitsandbytes, sentencepiece, protobuf, and flash-attn.
  2. Load the model using the AutoTokenizer and LlamaForCausalLM classes from Transformers.
  3. Configure the model for 4-bit or 8-bit loading if necessary:
    tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.1-405B', trust_remote_code=True)
    model = LlamaForCausalLM.from_pretrained(
        "NousResearch/Hermes-3-Llama-3.1-405B",
        torch_dtype=torch.float16,
        device_map="auto",
        load_in_8bit=False,
        load_in_4bit=True,
        use_flash_attention_2=True
    )
    
  4. Prepare and tokenize prompts for generation.
  5. Generate responses using the model's generate method.

For optimal performance, use cloud GPUs capable of supporting large models, such as those provided by AWS or Google Cloud.

License

The Hermes-3-Llama-3.1-405B model is licensed under the llama3 license, which governs its usage and distribution.

More Related APIs in Text Generation