Llama 3.1 Nemotron 70 B Instruct H F
nvidiaIntroduction
Llama-3.1-Nemotron-70B-Instruct is a language model developed by NVIDIA, designed to enhance the helpfulness of responses generated by large language models (LLMs) to user queries. It excels in various alignment benchmarks, outperforming other frontier models like GPT-4o and Claude 3.5 Sonnet.
Architecture
- Architecture Type: Transformer
- Network Architecture: Llama 3.1
- Input: Text (String format, max 128k tokens)
- Output: Text (String format, max 4k tokens)
- Supported Hardware: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Turing
- Operating System: Linux
Training
The model was trained using the REINFORCE algorithm within the NeMo Aligner framework. It utilized a dataset from HelpSteer2, comprising 21,362 prompt-responses aimed at improving model alignment with human preferences. The training dataset consisted of 20,324 entries, with 1,038 reserved for validation.
Guide: Running Locally
- Prerequisites: Requires 2 or more 80GB GPUs (NVIDIA Ampere or newer) and at least 150GB of free disk space.
- Software Requirements: Transformers library v4.44.0, PyTorch v2.4.0.
- Installation and Execution:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "How many r in strawberry?" messages = [{"role": "user", "content": prompt}] tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True) response_token_ids = model.generate(tokenized_message['input_ids'].cuda(), attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=4096, pad_token_id=tokenizer.eos_token_id) generated_tokens = response_token_ids[:, len(tokenized_message['input_ids'][0]):] generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0] print(generated_text)
- Cloud GPU Recommendation: Consider using cloud services like AWS, GCP, or Azure with compatible NVIDIA GPUs for efficient model execution.
License
By accessing this model, users agree to the LLama 3.1 terms and conditions, the acceptable use policy, and Meta’s privacy policy. The license details can be found here.