Smol L M2 360 M G G U F
QuantFactoryIntroduction
SmolLM2-360M-GGUF is a quantized version of the SmolLM2-360M model created using llama.cpp. The SmolLM2 family comprises compact language models designed to handle a wide array of tasks while being efficient enough to operate on-device. The 360M model is notable for its improvements in instruction following, knowledge, and reasoning.
Architecture
The SmolLM2 models utilize a transformer decoder architecture. The 360M model was pretrained on 4 trillion tokens and uses bfloat16 precision. It incorporates a diverse set of datasets, including FineWeb-Edu, DCLM, and The Stack. Instruction capabilities are enhanced through supervised fine-tuning and Direct Preference Optimization with UltraFeedback.
Training
-
Model: Transformer decoder
-
Pretraining Tokens: 4 trillion
-
Precision: Bfloat16
-
Hardware: Utilizes 64 H100 GPUs
-
Software: Trained using the nanotron framework
Guide: Running Locally
-
Installation: Install the necessary libraries using pip.
pip install transformers accelerate
-
Setup: Use the following Python code to load and run the model.
from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "HuggingFaceTB/SmolLM2-360M" device = "cuda" # Use "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device) outputs = model.generate(inputs) print(tokenizer.decode(outputs[0]))
-
Multi-GPU Setup: For multi-GPU usage, ensure
accelerate
is installed and use thedevice_map="auto"
option. -
Cloud GPU Suggestion: Consider using cloud platforms with GPU support, such as AWS EC2, Google Cloud, or Azure, for more efficient processing.
License
SmolLM2-360M-GGUF is licensed under the Apache 2.0 License, which allows for free use, modification, and distribution of the software.