codegen 350 M mono
SalesforceIntroduction
CodeGen is a series of autoregressive language models designed for program synthesis, as described in the paper "A Conversational Paradigm for Program Synthesis." The models are available in different pre-training data variants (NL, Multi, Mono) and model sizes (350M, 2B, 6B, 16B). CodeGen-Mono 350M, specifically, is pre-trained on Python programming data.
Architecture
CodeGen-Mono 350M is initialized with CodeGen-Multi 350M and further pre-trained using a dataset of Python programming language. It contains 350 million trainable parameters, designed to generate code from natural language prompts.
Training
This model was pre-trained on the BigPython dataset, consisting of 71.7 billion Python tokens. Training employed cross-entropy loss to maximize sequential input likelihood. Multiple TPU-v4-512 units by Google were used, employing data and model parallelism.
Guide: Running Locally
To use CodeGen-Mono 350M locally:
- Install Transformers: Ensure you have the Hugging Face Transformers library installed.
- Load the Model: Use the
AutoTokenizer
andAutoModelForCausalLM
classes to load the model. - Generate Code: Input a prompt in the form of a comment string and use the
generate
method to produce code.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Cloud GPUs
Consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure for efficient model inference, especially for larger models.
License
The CodeGen-Mono 350M model is licensed under the BSD 3-Clause License.