legalbert large 1.7 M 2
pile-of-lawIntroduction
The Pile of Law BERT Large Model 2 (Uncased) is a transformers model based on the BERT large architecture. It is pretrained on the Pile of Law dataset, which consists of approximately 256GB of English legal and administrative text. This model is designed for tasks like masked language modeling and can be further fine-tuned for specific legal applications.
Architecture
The model uses the BERT large architecture with a custom vocabulary of 32,000 tokens. This includes 29,000 tokens from a word-piece vocabulary tailored to the Pile of Law dataset, and 3,000 legal terms from Black's Law Dictionary. The architecture supports features like masked language modeling, leveraging the RoBERTa training objective.
Training
The model is pretrained using the Pile of Law dataset, which includes diverse sources such as court opinions, legal analyses, and statutes. Training was conducted on a SambaNova cluster with 8 RDUs, focusing on masked language modeling without next sentence prediction (NSP) loss. The model was trained for 1.7 million steps using a learning rate of 5e-6 and a batch size of 128. The dataset is under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.
Guide: Running Locally
- Install Libraries: Ensure
transformers
,torch
, ortensorflow
(depending on preference) are installed. - Load the Model:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2') model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
- Tokenize and Run Inference:
text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
- Cloud GPUs: For faster processing, consider using cloud GPU services like AWS EC2, Google Cloud Platform, or Azure.
License
The Pile of Law dataset, used for training, is covered under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license. This imposes certain restrictions on commercial use. Please refer to the Pile of Law documentation for more details regarding copyright and bias considerations.