llama 3.2 M E D I T 3 B o1

mkurman

Introduction

The mkurman/llama-3.2-MEDIT-3B-o1 is a fine-tuned variant of the Llama 3.2 3B Instruct model, designed for reasoning tasks. It utilizes a unique tagging system for chain-of-thought text generation, focusing on deterministic outputs.

Architecture

  • Model Name: mkurman/llama-3.2-MEDIT-3B-o1
  • Type: Small Language Model (SLM)
  • Base Model: MedIT Solutions Llama 3.2 3B Instruct
  • Parameters: 3 billion
  • License: llama3.2

Intended Use Cases

  • General question answering
  • Instruction-based generation
  • Reasoning and chain-of-thought exploration

Not Recommended For

  • Sensitive medical diagnoses without expert verification
  • Highly domain-specific or regulated fields outside the model's training scope

Training

The model incorporates <Thought> and <Output> tags for structured text generation. Fine-tuning was performed to enhance exact matching for reasoning tasks, recommending settings like do_sample=False or temperature=0.0 for consistent output.

Guide: Running Locally

In Ollama or LM Studio

  1. Load the GGUF File: Follow the specific instructions from Ollama or LM Studio to set up the model.
  2. Run the Model: Use the CLI command:
    ollama run hf.co/mkurman/llama-3.2-MEDIT-3B-o1
    
  3. Set Stop Sequences: Ensure stop sequences are set to </Output>.

In a Jupyter Notebook or Python Script (Transformers)

  1. Load Tokenizer and Model:
    from transformers import AutoTokenizer, AutoModelForCausalLM
    tokenizer = AutoTokenizer.from_pretrained("mkurman/llama-3.2-MEDIT-3B-o1")
    model = AutoModelForCausalLM.from_pretrained("mkurman/llama-3.2-MEDIT-3B-o1")
    
  2. Define and Encode Prompt:
    prompt = [{'role': 'user', 'content': 'Write a short instagram post about hypertension in children. Finish with 3 hashtags'}]
    input_ids = tokenizer(tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) + '<Thought>\n\n', return_tensors='pt')
    
  3. Generate Response:
    output = model.generate(input_ids=input_ids, max_new_tokens=256, do_sample=False, temperature=0.0)
    
  4. Decode Output:
    decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
    print(decoded_output)
    

Suggested Cloud GPUs

To enhance performance and manage computational demands, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

Refer to the Llama 3.2 Community License Agreement for details on usage rights. For citation, use the following format:

@misc{mkurman2025llama3medit3bo1,
  title={{mkurman/llama-3.2-MEDIT-3B-o1}: A fine-tuned Llama 3.2 3B Instruct model for reasoning tasks},
  author={Kurman, Mariusz},
  year={2025},
  howpublished={\url{https://huggingface.co/mkurman/llama-3.2-MEDIT-3B-o1}}
}

More Related APIs in Text Generation