Nemotron Mini 4 B Instruct
nvidiaIntroduction
Nemotron-Mini-4B-Instruct is a compact language model developed by NVIDIA, optimized for roleplaying, retrieval augmented generation (RAG), and function calling. It is designed for speed and on-device deployment, being a fine-tuned version of nvidia/Minitron-4B-Base. The model employs distillation, pruning, and quantization techniques, supporting a context length of 4,096 tokens. It is commercially viable and primarily used in English.
Architecture
The Nemotron-Mini-4B-Instruct model features:
- Model Embedding Size: 3072
- Attention Heads: 32
- MLP Intermediate Dimension: 9216
- Grouped-Query Attention (GQA)
- Rotary Position Embeddings (RoPE)
- Architecture Type: Transformer Decoder (auto-regressive language model)
Training
The model was trained between February 2024 and August 2024, utilizing NVIDIA's LLM compression techniques to distill and prune from a larger model, Nemotron-4 15B. This process enhances the model's efficiency for specific tasks like roleplay and RAG QA.
Guide: Running Locally
Basic Steps
- Install the Transformers library: Ensure you have the
transformers
library installed in your Python environment. - Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct") model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Mini-4B-Instruct")
- Apply Prompt Template: Use the recommended prompt format for optimal performance.
- Generate Responses:
outputs = model.generate(tokenized_chat, max_new_tokens=128) print(tokenizer.decode(outputs[0]))
Cloud GPUs
For efficient performance, consider using cloud-based GPU resources such as those provided by AWS, Google Cloud, or Azure, which offer instances specifically optimized for machine learning workloads.
License
The model is released under the NVIDIA Community Model License. For detailed terms, please refer to the NVIDIA Community Model License.