Introduction

CodeT5 is a small-sized, pre-trained encoder-decoder Transformer model designed for code understanding and generation. It is identifier-aware, leveraging code semantics to improve performance across tasks such as code summarization, generation, translation, refinement, defect detection, and clone detection.

Architecture

CodeT5 employs a unified framework that supports both understanding and generation tasks in code. It introduces a novel identifier-aware pre-training task, enabling the model to distinguish and recover masked identifiers. Additionally, the model uses a bimodal dual generation task to align natural language and programming language more effectively.

Training

The model was pretrained using the CodeSearchNet dataset, supplemented with C/CSharp datasets from BigQuery. A total of approximately 8.35 million instances were used for pretraining. The RobertaTokenizer is utilized for preprocessing, employing a code-specific BPE tokenizer.

Guide: Running Locally

  1. Install Transformers Library: Ensure you have the transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer: Use the following Python code to load the model.

    from transformers import RobertaTokenizer, T5ForConditionalGeneration
    
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-small')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-small')
    
  3. Input Code and Generate: Prepare your code input and generate a sequence.

    text = "def greet(user): print(f'hello <extra_id_0>!')"
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    generated_ids = model.generate(input_ids, max_length=10)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    
  4. Consider Using Cloud GPUs: For more intensive tasks, consider utilizing cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure to enhance performance.

License

CodeT5 is released under the Apache License 2.0, allowing for wide use and modification.

More Related APIs in Text2text Generation