Smol L M2 360 M G G U F

QuantFactory

Introduction

SmolLM2-360M-GGUF is a quantized version of the SmolLM2-360M model created using llama.cpp. The SmolLM2 family comprises compact language models designed to handle a wide array of tasks while being efficient enough to operate on-device. The 360M model is notable for its improvements in instruction following, knowledge, and reasoning.

Architecture

The SmolLM2 models utilize a transformer decoder architecture. The 360M model was pretrained on 4 trillion tokens and uses bfloat16 precision. It incorporates a diverse set of datasets, including FineWeb-Edu, DCLM, and The Stack. Instruction capabilities are enhanced through supervised fine-tuning and Direct Preference Optimization with UltraFeedback.

Training

  • Model: Transformer decoder

  • Pretraining Tokens: 4 trillion

  • Precision: Bfloat16

  • Hardware: Utilizes 64 H100 GPUs

  • Software: Trained using the nanotron framework

Guide: Running Locally

  1. Installation: Install the necessary libraries using pip.

    pip install transformers accelerate
    
  2. Setup: Use the following Python code to load and run the model.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    checkpoint = "HuggingFaceTB/SmolLM2-360M"
    device = "cuda"  # Use "cpu" for CPU usage
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
    
    inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
    outputs = model.generate(inputs)
    print(tokenizer.decode(outputs[0]))
    
  3. Multi-GPU Setup: For multi-GPU usage, ensure accelerate is installed and use the device_map="auto" option.

  4. Cloud GPU Suggestion: Consider using cloud platforms with GPU support, such as AWS EC2, Google Cloud, or Azure, for more efficient processing.

License

SmolLM2-360M-GGUF is licensed under the Apache 2.0 License, which allows for free use, modification, and distribution of the software.

More Related APIs