codet5 small
SalesforceIntroduction
CodeT5 is a small-sized, pre-trained encoder-decoder Transformer model designed for code understanding and generation. It is identifier-aware, leveraging code semantics to improve performance across tasks such as code summarization, generation, translation, refinement, defect detection, and clone detection.
Architecture
CodeT5 employs a unified framework that supports both understanding and generation tasks in code. It introduces a novel identifier-aware pre-training task, enabling the model to distinguish and recover masked identifiers. Additionally, the model uses a bimodal dual generation task to align natural language and programming language more effectively.
Training
The model was pretrained using the CodeSearchNet dataset, supplemented with C/CSharp datasets from BigQuery. A total of approximately 8.35 million instances were used for pretraining. The RobertaTokenizer is utilized for preprocessing, employing a code-specific BPE tokenizer.
Guide: Running Locally
-
Install Transformers Library: Ensure you have the
transformers
library installed.pip install transformers
-
Load the Model and Tokenizer: Use the following Python code to load the model.
from transformers import RobertaTokenizer, T5ForConditionalGeneration tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-small') model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-small')
-
Input Code and Generate: Prepare your code input and generate a sequence.
text = "def greet(user): print(f'hello <extra_id_0>!')" input_ids = tokenizer(text, return_tensors="pt").input_ids generated_ids = model.generate(input_ids, max_length=10) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
-
Consider Using Cloud GPUs: For more intensive tasks, consider utilizing cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure to enhance performance.
License
CodeT5 is released under the Apache License 2.0, allowing for wide use and modification.