Smol L M2 360 M G G U F LLM Model

Introduction

SmolLM2-360M-GGUF is a quantized version of the SmolLM2-360M model created using llama.cpp. The SmolLM2 family comprises compact language models designed to handle a wide array of tasks while being efficient enough to operate on-device. The 360M model is notable for its improvements in instruction following, knowledge, and reasoning.

Architecture

The SmolLM2 models utilize a transformer decoder architecture. The 360M model was pretrained on 4 trillion tokens and uses bfloat16 precision. It incorporates a diverse set of datasets, including FineWeb-Edu, DCLM, and The Stack. Instruction capabilities are enhanced through supervised fine-tuning and Direct Preference Optimization with UltraFeedback.

Training

Model: Transformer decoder
Pretraining Tokens: 4 trillion
Precision: Bfloat16
Hardware: Utilizes 64 H100 GPUs
Software: Trained using the nanotron framework

Guide: Running Locally

Installation: Install the necessary libraries using pip.
```
pip install transformers accelerate
```

Setup: Use the following Python code to load and run the model.

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM2-360M"
device = "cuda"  # Use "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Multi-GPU Setup: For multi-GPU usage, ensure accelerate is installed and use the device_map="auto" option.
Cloud GPU Suggestion: Consider using cloud platforms with GPU support, such as AWS EC2, Google Cloud, or Azure, for more efficient processing.

License

SmolLM2-360M-GGUF is licensed under the Apache 2.0 License, which allows for free use, modification, and distribution of the software.

More Related APIs