Introduction

CodeT5 is an encoder-decoder language model family designed for code understanding and generation. It was introduced in the paper "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation" by Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5-large is a checkpoint in this family, consisting of 770 million parameters. It is further enhanced in the paper "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning" by Hung Le et al.

Architecture

CodeT5-large is an encoder-decoder model that is identifier-aware, which means it is specifically tuned for understanding and generating code. The model architecture is built to handle various programming languages using a unified approach.

Training

CodeT5-large was pretrained using the CodeSearchNet dataset, which includes code in six programming languages: Ruby, JavaScript, Go, Python, Java, and PHP. The pretraining involved a masked span prediction objective over 150 epochs. The model's effectiveness was validated using the CodeXGLUE benchmark.

Guide: Running Locally

To use CodeT5-large locally, follow these steps:

  1. Install the transformers library from Hugging Face:
    pip install transformers
    
  2. Load the model and tokenizer:
    from transformers import AutoTokenizer, T5ForConditionalGeneration
    
    tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large")
    model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large")
    
  3. Prepare input text and generate a sequence:
    text = "def greet(user): print(f'hello <extra_id_0>!')"
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    generated_ids = model.generate(input_ids, max_length=8)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    

For performance optimization, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

CodeT5-large is distributed under the BSD-3-Clause license, permitting use, modification, and distribution of the software with attribution.

More Related APIs in Text2text Generation