legal roberta base
lexlmsIntroduction
Legal-RoBERTa-Base is a language model built upon the RoBERTa architecture, specifically adapted for legal language processing. It is part of the LexLM series, designed to handle legal texts using a specialized corpus called LeXFiles.
Architecture
The model is derived from RoBERTa-base and utilizes a tokenizer with 50,000 Byte Pair Encodings (BPEs). It reuses original embeddings for lexically overlapping tokens and employs a sentence sampler for balanced training across sub-corpora. The model supports mixed case inputs, aligning with recent advancements in Pretrained Language Models (PLMs).
Training
The model was pre-trained on the LeXFiles corpus, optimizing over 1 million steps with a batch size of 512 samples. The training utilized Adam optimizer, a learning rate of 0.0001, and a cosine learning rate scheduler. Key hyperparameters included a batch size of 32, a total training batch size of 512, and a TPU deployment with 8 devices. Training was distributed with gradient accumulation steps set to 2.
Training Hyperparameters
- Learning Rate: 0.0001
- Batch Size: 32 (Train), 32 (Eval)
- Distributed Type: TPU
- Num Devices: 8
- Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
- Scheduler: Cosine
- Training Steps: 1,000,000
Training Results
The training process achieved progressive improvements in validation loss, evidencing effective learning and adaptation to the legal domain.
Framework Versions
- Transformers: 4.20.0
- PyTorch: 1.12.0+cu102
- Datasets: 2.6.1
- Tokenizers: 0.12.0
Guide: Running Locally
-
Environment Setup:
- Install PyTorch and Transformers libraries.
- Ensure a compatible Python environment (e.g., Python 3.8 or higher).
-
Model Download:
- Download the legal-roberta-base model from Hugging Face's model hub.
-
Inference:
- Use the Fill-Mask pipeline to perform text infill tasks with legal text inputs.
-
Suggested Cloud GPUs:
- Utilize cloud platforms like AWS, GCP, or Azure for GPU instances to accelerate training and inference tasks.
License
The Legal-RoBERTa-Base model is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0), allowing sharing and adaptation with proper attribution.