codebert base finetuned detect insecure code

mrm8488

Introduction

This document provides information about a fine-tuned version of CodeBERT for detecting insecure code. It is trained on the CodeXGLUE dataset for defect detection to identify vulnerabilities in source code.

Architecture

CodeBERT is a bimodal pre-trained model designed for programming and natural language tasks. It uses a Transformer-based neural architecture and a hybrid objective function that includes replaced token detection. This model supports various applications like natural language code search and code documentation generation, achieving state-of-the-art performance in these areas.

Training

The model is trained on the CodeXGLUE dataset for defect detection, treating the task as binary classification to identify insecure code. The dataset is split into 80% for training, 10% for development, and 10% for testing. The model outperforms previous models with an accuracy of 65.30% on the test set.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Install the transformers library via pip:

    pip install transformers torch
    
  2. Load the Model: Use the following Python code to load and run the model:

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    import numpy as np
    
    tokenizer = AutoTokenizer.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')
    model = AutoModelForSequenceClassification.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')
    
    inputs = tokenizer("your code here", return_tensors="pt", truncation=True, padding='max_length')
    labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    outputs = model(**inputs, labels=labels)
    loss = outputs.loss
    logits = outputs.logits
    
    print(np.argmax(logits.detach().numpy()))
    
  3. Cloud GPUs: To improve performance, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.

License

The model and associated documentation are available under the MIT License. This allows for both personal and commercial use, modification, and distribution.

More Related APIs in Text Classification