Nemotron Mini 4 B Instruct LLM Model

Introduction

Nemotron-Mini-4B-Instruct is a compact language model developed by NVIDIA, optimized for roleplaying, retrieval augmented generation (RAG), and function calling. It is designed for speed and on-device deployment, being a fine-tuned version of nvidia/Minitron-4B-Base. The model employs distillation, pruning, and quantization techniques, supporting a context length of 4,096 tokens. It is commercially viable and primarily used in English.

Architecture

The Nemotron-Mini-4B-Instruct model features:

Model Embedding Size: 3072
Attention Heads: 32
MLP Intermediate Dimension: 9216
Grouped-Query Attention (GQA)
Rotary Position Embeddings (RoPE)
Architecture Type: Transformer Decoder (auto-regressive language model)

Training

The model was trained between February 2024 and August 2024, utilizing NVIDIA's LLM compression techniques to distill and prune from a larger model, Nemotron-4 15B. This process enhances the model's efficiency for specific tasks like roleplay and RAG QA.

Guide: Running Locally

Basic Steps

Install the Transformers library: Ensure you have the transformers library installed in your Python environment.

Load the Model and Tokenizer:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")

Apply Prompt Template: Use the recommended prompt format for optimal performance.

Generate Responses:

outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

Cloud GPUs

For efficient performance, consider using cloud-based GPU resources such as those provided by AWS, Google Cloud, or Azure, which offer instances specifically optimized for machine learning workloads.

License

The model is released under the NVIDIA Community Model License. For detailed terms, please refer to the NVIDIA Community Model License.

More Related APIs