Introduction

CodeT5 is a pre-trained encoder-decoder Transformer model designed for code understanding and generation, developed by Salesforce. It leverages code semantics through developer-assigned identifiers and supports multi-task learning for both code understanding and generation. The model is particularly effective in tasks such as code defect detection, code clone detection, and code translation.

Architecture

CodeT5 employs a unified framework that seamlessly integrates both code understanding and generation tasks. The model includes a novel identifier-aware pre-training task, which helps it distinguish code tokens that are identifiers and recover them when masked. Additionally, it uses a bimodal dual generation task to enhance the alignment between natural language and programming language (NL-PL).

Training

The model was pre-trained on the CodeSearchNet dataset, supplemented by additional C/CSharp datasets from BigQuery. In total, approximately 8.35 million instances were used for pretraining. CodeT5 uses a code-specific BPE tokenizer built with the Hugging Face Tokenizers library. The training focuses on enhancing the model's ability to capture semantic information from code.

Guide: Running Locally

  1. Install the Transformers library:

    pip install transformers
    
  2. Load the tokenizer and model:

    from transformers import RobertaTokenizer, T5ForConditionalGeneration
    
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
    
  3. Prepare your input:

    text = "def greet(user): print(f'hello <extra_id_0>!')"
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    
  4. Generate output:

    generated_ids = model.generate(input_ids, max_length=8)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    
  5. Cloud GPU Suggestion: For optimal performance, consider using cloud GPUs such as those available on Google Cloud Platform, AWS, or Azure.

License

CodeT5 is released under the Apache 2.0 license.

More Related APIs in Text2text Generation