codegen 2 B multi
SalesforceIntroduction
CodeGen is an advanced autoregressive language model developed for program synthesis. The model is part of a research effort described in the paper "A Conversational Paradigm for Program Synthesis" by researchers Erik Nijkamp et al. This family of models is designed to generate executable code from English language prompts, offering a powerful tool for software development and code generation.
Architecture
The CodeGen architecture is available in multiple configurations, notably with different pre-training data variants (NL, Multi, Mono) and model sizes (350M, 2B, 6B, 16B). The featured model, CodeGen-Multi 2B, is initialized with CodeGen-NL 2B and further trained on a multilingual dataset. It comprises 2 billion trainable parameters, facilitating complex program synthesis tasks.
Training
The CodeGen-Multi 2B model was initially trained with CodeGen-NL 2B and subsequently pre-trained on a dataset from BigQuery, which includes data from GitHub repositories spanning several programming languages like C, C++, Go, Java, JavaScript, and Python. The training involved processing 119.2 billion tokens using cross-entropy loss to maximize the likelihood of input sequences. The models were trained on Google’s TPU-v4-512s utilizing data and model parallelism, as detailed in the referenced paper.
Guide: Running Locally
To run the CodeGen model locally, follow these steps:
- Install the Hugging Face Transformers library:
pip install transformers
- Load the model and tokenizer in your Python script:
from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-multi") model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-multi")
- Prepare your input text and generate code:
text = "def hello_world():" input_ids = tokenizer(text, return_tensors="pt").input_ids generated_ids = model.generate(input_ids, max_length=128) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
For optimal performance, consider using cloud-based GPU resources from providers like AWS, Google Cloud, or Azure.
License
The CodeGen model is released under the BSD-3-Clause License, permitting use, distribution, and modification with certain conditions.