codebert base finetuned detect insecure code
mrm8488Introduction
This document provides information about a fine-tuned version of CodeBERT for detecting insecure code. It is trained on the CodeXGLUE dataset for defect detection to identify vulnerabilities in source code.
Architecture
CodeBERT is a bimodal pre-trained model designed for programming and natural language tasks. It uses a Transformer-based neural architecture and a hybrid objective function that includes replaced token detection. This model supports various applications like natural language code search and code documentation generation, achieving state-of-the-art performance in these areas.
Training
The model is trained on the CodeXGLUE dataset for defect detection, treating the task as binary classification to identify insecure code. The dataset is split into 80% for training, 10% for development, and 10% for testing. The model outperforms previous models with an accuracy of 65.30% on the test set.
Guide: Running Locally
-
Install Dependencies: Ensure you have Python and PyTorch installed. Install the
transformers
library via pip:pip install transformers torch
-
Load the Model: Use the following Python code to load and run the model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import numpy as np tokenizer = AutoTokenizer.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code') model = AutoModelForSequenceClassification.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code') inputs = tokenizer("your code here", return_tensors="pt", truncation=True, padding='max_length') labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 outputs = model(**inputs, labels=labels) loss = outputs.loss logits = outputs.logits print(np.argmax(logits.detach().numpy()))
-
Cloud GPUs: To improve performance, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.
License
The model and associated documentation are available under the MIT License. This allows for both personal and commercial use, modification, and distribution.