Llama 3.1 Nemotron 70 B Instruct

nvidia

Introduction

Llama-3.1-Nemotron-70B-Instruct is a large language model developed by NVIDIA to enhance the quality of LLM-generated responses to user queries. It achieves top performance on several alignment benchmarks and is ranked highly on the ChatBot Arena leaderboard.

Architecture

The model is based on the Transformer architecture, specifically utilizing the Llama 3.1 framework. It is designed to handle a maximum input of 128k tokens and can produce outputs up to 4k tokens.

Training

The model was trained using the REINFORCE algorithm within the NVIDIA NeMo Aligner framework. Training involved the HelpSteer2-Preference dataset, which contains 21,362 prompt-response pairs aimed at aligning model outputs with human preferences for helpfulness, factual accuracy, coherence, and customization. The dataset comprises both human and synthetic data, with 20,324 pairs used for training and 1,038 for validation.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Prerequisites: Ensure you have a machine with at least 4 NVIDIA GPUs of 40GB or 2 GPUs of 80GB and 150GB of free disk space.
  2. Sign Up: Register for access to the NVIDIA NeMo Framework container on the NVIDIA Developer site.
  3. API Key: Obtain an NVIDIA NGC API key by signing into your NVIDIA NGC account.
  4. Docker Login: Log into nvcr.io using your NVIDIA NGC credentials.
    docker login nvcr.io
    
  5. Pull Docker Container: Download the NeMo container.
    docker pull nvcr.io/nvidia/nemo:24.05.llama3.1
    
  6. Clone Checkpoint: Use Git LFS to clone the model checkpoint.
    git lfs install
    git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct
    
  7. Run Docker Container: Execute the container with the necessary configurations.
    docker run --gpus all -it --rm --shm-size=150g -p 8000:8000 -v ${PWD}/Llama-3.1-Nemotron-70B-Instruct:/opt/checkpoints/Llama-3.1-Nemotron-70B-Instruct,${HF_HOME}:/hf_home -w /opt/NeMo nvcr.io/nvidia/nemo:24.05.llama3.1
    
  8. Start Server: Deploy the model within the container.
    HF_HOME=/hf_home python scripts/deploy/nlp/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/Llama-3.1-Nemotron-70B-Instruct --model_type="llama" --triton_model_name nemotron --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
    
  9. Launch Client: Once the server is ready, use client code to query the model.
    python scripts/deploy/nlp/query_inframework.py -mn nemotron -p "How many r in strawberry?" -mol 1024
    

Cloud GPUs such as NVIDIA's A100 or H100 are recommended for optimal performance.

License

By using this model, users agree to the LLama 3.1 terms and conditions, acceptable use policy, and Meta's privacy policy. These can be found in the LLama 3.1 license documentation.

More Related APIs