codet5 base
SalesforceIntroduction
CodeT5 is a pre-trained encoder-decoder Transformer model designed for code understanding and generation, developed by Salesforce. It leverages code semantics through developer-assigned identifiers and supports multi-task learning for both code understanding and generation. The model is particularly effective in tasks such as code defect detection, code clone detection, and code translation.
Architecture
CodeT5 employs a unified framework that seamlessly integrates both code understanding and generation tasks. The model includes a novel identifier-aware pre-training task, which helps it distinguish code tokens that are identifiers and recover them when masked. Additionally, it uses a bimodal dual generation task to enhance the alignment between natural language and programming language (NL-PL).
Training
The model was pre-trained on the CodeSearchNet dataset, supplemented by additional C/CSharp datasets from BigQuery. In total, approximately 8.35 million instances were used for pretraining. CodeT5 uses a code-specific BPE tokenizer built with the Hugging Face Tokenizers library. The training focuses on enhancing the model's ability to capture semantic information from code.
Guide: Running Locally
-
Install the Transformers library:
pip install transformers
-
Load the tokenizer and model:
from transformers import RobertaTokenizer, T5ForConditionalGeneration tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base') model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
-
Prepare your input:
text = "def greet(user): print(f'hello <extra_id_0>!')" input_ids = tokenizer(text, return_tensors="pt").input_ids
-
Generate output:
generated_ids = model.generate(input_ids, max_length=8) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-
Cloud GPU Suggestion: For optimal performance, consider using cloud GPUs such as those available on Google Cloud Platform, AWS, or Azure.
License
CodeT5 is released under the Apache 2.0 license.