Code B E R Ta small v1
huggingfaceIntroduction
CodeBERTa-small-v1 is a RoBERTa-like Transformer model designed for code representation. It is trained on the CodeSearchNet dataset, which includes code from multiple programming languages such as Go, Java, JavaScript, PHP, Python, and Ruby. The model uses a Byte-level BPE tokenizer tailored for code, allowing for efficient encoding.
Architecture
CodeBERTa-small-v1 is structured as a 6-layer Transformer model with 84 million parameters, similar to DistilBERT. It employs the default initialization settings and is trained from scratch on approximately 2 million functions for 5 epochs.
Training
The model was trained using the CodeSearchNet dataset from GitHub. This dataset is specifically designed for code, allowing the model to efficiently tokenize and understand code sequences, which are 33% to 50% shorter compared to other models like GPT-2 or RoBERTa. The training process is documented with TensorBoard.
Guide: Running Locally
To run CodeBERTa-small-v1 locally:
-
Install the Transformers library:
pip install transformers
-
Load the model and tokenizer:
from transformers import pipeline fill_mask = pipeline( "fill-mask", model="huggingface/CodeBERTa-small-v1", tokenizer="huggingface/CodeBERTa-small-v1" )
-
Use the model to perform tasks like masked language modeling:
PHP_CODE = """ public static <mask> set(string $key, $value) { if (!in_array($key, self::$allowedKeys)) { throw new \InvalidArgumentException('Invalid key given'); } self::$storedValues[$key] = $value; } """.lstrip() fill_mask(PHP_CODE)
-
Consider using cloud GPUs: For optimal performance when training or running large models, consider using cloud GPU services like Google Cloud, AWS, or Azure.
License
The terms of use for CodeBERTa-small-v1 follow the licensing agreements specified by Hugging Face, applicable to models and datasets hosted on their platform. It's important to review these terms when utilizing the model in both personal and commercial projects.