Secure B E R T

ehsanaghaei

Introduction

SecureBERT is a domain-specific language model based on RoBERTa, designed to understand and represent cybersecurity textual data. It has been trained on extensive in-domain text from online resources, showcasing superior performance compared to existing models like RoBERTa, SciBERT, and SecBERT in tasks such as masked word prediction and general English language understanding.

Architecture

SecureBERT is built on the RoBERTa architecture, fine-tuned with cybersecurity-specific data. This enables it to effectively handle various downstream tasks, including text classification, named entity recognition (NER), sequence-to-sequence tasks, and question answering (QA).

Training

The model is trained using a large corpus of cybersecurity-related text, enhancing its ability to predict masked words and understand context within cybersecurity domains. This training process focuses on improving its performance in both cybersecurity-specific tasks and general language understanding.

Guide: Running Locally

To use SecureBERT locally, follow these steps:

  1. Install Dependencies:

    pip install transformers torch tokenizers
    
  2. Load the Model:

    from transformers import RobertaTokenizer, RobertaModel
    import torch
    
    tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT")
    model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT")
    
    inputs = tokenizer("This is SecureBERT!", return_tensors="pt")
    outputs = model(**inputs)
    
    last_hidden_states = outputs.last_hidden_state
    
  3. Fill Mask Example:

    from transformers import RobertaTokenizerFast, RobertaForMaskedLM
    
    tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT")
    model = RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT")
    
    def predict_mask(sent, tokenizer, model, topk=10):
        token_ids = tokenizer.encode(sent, return_tensors='pt')
        masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
        masked_pos = [mask.item() for mask in masked_position]
        words = []
        with torch.no_grad():
            output = model(token_ids)
        last_hidden_state = output[0].squeeze()
    
        for mask_index in masked_pos:
            mask_hidden_state = last_hidden_state[mask_index]
            idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
            words = [tokenizer.decode(i.item()).strip() for i in idx]
            print("Mask Predictions:", words)
        return words
    
    while True:
        sent = input("Text here: \t")
        print("SecureBERT: ")
        predict_mask(sent, tokenizer, model)
    

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

SecureBERT is released under the BigScience OpenRAIL-M license, which allows for open access and use of the model.

More Related APIs in Fill Mask