Secure B E R T
ehsanaghaeiIntroduction
SecureBERT is a domain-specific language model based on RoBERTa, designed to understand and represent cybersecurity textual data. It has been trained on extensive in-domain text from online resources, showcasing superior performance compared to existing models like RoBERTa, SciBERT, and SecBERT in tasks such as masked word prediction and general English language understanding.
Architecture
SecureBERT is built on the RoBERTa architecture, fine-tuned with cybersecurity-specific data. This enables it to effectively handle various downstream tasks, including text classification, named entity recognition (NER), sequence-to-sequence tasks, and question answering (QA).
Training
The model is trained using a large corpus of cybersecurity-related text, enhancing its ability to predict masked words and understand context within cybersecurity domains. This training process focuses on improving its performance in both cybersecurity-specific tasks and general language understanding.
Guide: Running Locally
To use SecureBERT locally, follow these steps:
-
Install Dependencies:
pip install transformers torch tokenizers
-
Load the Model:
from transformers import RobertaTokenizer, RobertaModel import torch tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT") model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT") inputs = tokenizer("This is SecureBERT!", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state
-
Fill Mask Example:
from transformers import RobertaTokenizerFast, RobertaForMaskedLM tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT") model = RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT") def predict_mask(sent, tokenizer, model, topk=10): token_ids = tokenizer.encode(sent, return_tensors='pt') masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero() masked_pos = [mask.item() for mask in masked_position] words = [] with torch.no_grad(): output = model(token_ids) last_hidden_state = output[0].squeeze() for mask_index in masked_pos: mask_hidden_state = last_hidden_state[mask_index] idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1] words = [tokenizer.decode(i.item()).strip() for i in idx] print("Mask Predictions:", words) return words while True: sent = input("Text here: \t") print("SecureBERT: ") predict_mask(sent, tokenizer, model)
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
SecureBERT is released under the BigScience OpenRAIL-M license, which allows for open access and use of the model.