privbert
mukundIntroduction
PrivBERT is a language model specifically designed for privacy policies. It was pre-trained on approximately 1 million privacy policies, using the RoBERTa model as a starting point. The data used for training is accessible at the PrivaSeer website.
Architecture
PrivBERT builds upon the RoBERTa model architecture, optimizing it for understanding and processing privacy policy text. This approach leverages the strengths of transformers and the RoBERTa framework to enhance language understanding in the context of privacy policies.
Training
The model was trained using a large corpus of privacy policies to ensure a robust understanding of privacy-related language. The pre-training process involved leveraging an existing RoBERTa model, allowing PrivBERT to achieve a high level of proficiency in interpreting privacy policy documents.
Guide: Running Locally
To use PrivBERT locally, follow these basic steps:
-
Install the Hugging Face Transformers library:
pip install transformers
-
Load the model and tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("mukund/privbert") model = AutoModel.from_pretrained("mukund/privbert")
-
Use the model with your text data as needed.
For intensive tasks, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to handle model computations efficiently.
License
PrivBERT is available under a CC BY-NC-SA license for research, teaching, and scholarship. Users must cite the following paper when using the dataset in research:
Mukund Srinath, Shomir Wilson, and C. Lee Giles. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proc. ACL 2021.
For commercial use inquiries, please contact the model creators.