codegen 350 M multi
SalesforceIntroduction
The CodeGen-Multi 350M is an autoregressive language model designed for program synthesis. Developed by Salesforce, it is detailed in the paper "A Conversational Paradigm for Program Synthesis." The model is trained to generate executable code from English prompts in the form of comment strings.
Architecture
CodeGen-Multi 350M is part of a family of models that vary by size and training data. This specific model has 350 million trainable parameters and was pre-trained on a dataset comprising multiple programming languages. The model architecture is optimized for generating code snippets and completing partially-written code based on natural language inputs.
Training
The model was initially based on the CodeGen-NL 350M and further pre-trained using BigQuery, which includes a vast collection of programming languages like C, C++, Go, Java, JavaScript, and Python. The training used cross-entropy loss on Google TPU-v4-512s, employing both data and model parallelism techniques. The dataset consists of 119.2 billion tokens sourced from GitHub repositories.
Guide: Running Locally
To use CodeGen-Multi 350M locally, follow these steps:
-
Install Dependencies:
pip install transformers torch
-
Load the Model:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-multi") model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-multi")
-
Generate Code:
text = "def hello_world():" input_ids = tokenizer(text, return_tensors="pt").input_ids generated_ids = model.generate(input_ids, max_length=128) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-
Consider using Cloud GPUs like Google Cloud, AWS, or Azure for enhanced performance, especially if working with larger models or datasets.
License
The CodeGen-Multi 350M model is distributed under the BSD-3-Clause license, allowing for open-source use with some restrictions on liability and warranty.