codet5 large
SalesforceIntroduction
CodeT5 is an encoder-decoder language model family designed for code understanding and generation. It was introduced in the paper "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation" by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5-large is a checkpoint in this family, consisting of 770 million parameters. It is further enhanced in the paper "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning" by Hung Le et al.
Architecture
CodeT5-large is an encoder-decoder model that is identifier-aware, which means it is specifically tuned for understanding and generating code. The model architecture is built to handle various programming languages using a unified approach.
Training
CodeT5-large was pretrained using the CodeSearchNet dataset, which includes code in six programming languages: Ruby, JavaScript, Go, Python, Java, and PHP. The pretraining involved a masked span prediction objective over 150 epochs. The model's effectiveness was validated using the CodeXGLUE benchmark.
Guide: Running Locally
To use CodeT5-large locally, follow these steps:
- Install the
transformers
library from Hugging Face:pip install transformers
- Load the model and tokenizer:
from transformers import AutoTokenizer, T5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large") model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large")
- Prepare input text and generate a sequence:
text = "def greet(user): print(f'hello <extra_id_0>!')" input_ids = tokenizer(text, return_tensors="pt").input_ids generated_ids = model.generate(input_ids, max_length=8) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
For performance optimization, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
CodeT5-large is distributed under the BSD-3-Clause license, permitting use, modification, and distribution of the software with attribution.