Llama 3.1 Nemotron 70 B Instruct H F

nvidia

Introduction

Llama-3.1-Nemotron-70B-Instruct is a language model developed by NVIDIA, designed to enhance the helpfulness of responses generated by large language models (LLMs) to user queries. It excels in various alignment benchmarks, outperforming other frontier models like GPT-4o and Claude 3.5 Sonnet.

Architecture

  • Architecture Type: Transformer
  • Network Architecture: Llama 3.1
  • Input: Text (String format, max 128k tokens)
  • Output: Text (String format, max 4k tokens)
  • Supported Hardware: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Turing
  • Operating System: Linux

Training

The model was trained using the REINFORCE algorithm within the NeMo Aligner framework. It utilized a dataset from HelpSteer2, comprising 21,362 prompt-responses aimed at improving model alignment with human preferences. The training dataset consisted of 20,324 entries, with 1,038 reserved for validation.

Guide: Running Locally

  • Prerequisites: Requires 2 or more 80GB GPUs (NVIDIA Ampere or newer) and at least 150GB of free disk space.
  • Software Requirements: Transformers library v4.44.0, PyTorch v2.4.0.
  • Installation and Execution:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF"
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    prompt = "How many r in strawberry?"
    messages = [{"role": "user", "content": prompt}]
    
    tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
    response_token_ids = model.generate(tokenized_message['input_ids'].cuda(), attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=4096, pad_token_id=tokenizer.eos_token_id)
    generated_tokens = response_token_ids[:, len(tokenized_message['input_ids'][0]):]
    generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_text)
    
  • Cloud GPU Recommendation: Consider using cloud services like AWS, GCP, or Azure with compatible NVIDIA GPUs for efficient model execution.

License

By accessing this model, users agree to the LLama 3.1 terms and conditions, the acceptable use policy, and Meta’s privacy policy. The license details can be found here.

More Related APIs in Text Generation