Open Coder 8 B Instruct
inflyIntroduction
OpenCoder is a family of open and reproducible code language models (LLMs), including 1.5B and 8B base and chat models. It supports both English and Chinese, having been pre-trained on 2.5 trillion tokens (90% raw code, 10% code-related web data) and fine-tuned with over 4.5 million high-quality supervised fine-tuning (SFT) examples. OpenCoder aims to match the performance of top-tier code LLMs and is fully open-source, offering model weights, inference code, reproducible training data, and training protocols for researchers to build and innovate upon.
Architecture
OpenCoder offers full transparency by releasing model weights, inference code, data-cleaning code, checkpoints, and 4.5 million SFT entries. The model undergoes rigorous testing through ablation studies on data-cleaning and training processes to explore and validate its performance. The synthetic data generation process is robust, contributing to the model's high performance across language model benchmarks.
Training
OpenCoder models are pre-trained with datasets like the 148 GB fineweb-code-corpus and the 10 GB fineweb-math-corpus. Post-training is done using datasets like opencoder-sft-stage1 (4.21 million entries) and opencoder-sft-stage2 (375,000 entries). The training process involves extensive data cleaning and validation to ensure high-quality model outputs.
Guide: Running Locally
To run OpenCoder locally, follow these steps:
-
Install Transformers Library:
Make sure you have thetransformers
library installed. You can do this via pip:pip install transformers
-
Set Up Model and Tokenizer:
import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "infly/OpenCoder-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-
Generate Text:
messages = [{'role': 'user', 'content': "write a quick sort algorithm in python."}] inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") outputs = model.generate(inputs, max_new_tokens=512, do_sample=False) result = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True) print(result)
-
Consider using cloud GPUs for faster inference and handling larger models like OpenCoder-8B.
License
OpenCoder series supports commercial applications under a permissive license. Details can be found here.