legal roberta base

lexlms

Introduction

Legal-RoBERTa-Base is a language model built upon the RoBERTa architecture, specifically adapted for legal language processing. It is part of the LexLM series, designed to handle legal texts using a specialized corpus called LeXFiles.

Architecture

The model is derived from RoBERTa-base and utilizes a tokenizer with 50,000 Byte Pair Encodings (BPEs). It reuses original embeddings for lexically overlapping tokens and employs a sentence sampler for balanced training across sub-corpora. The model supports mixed case inputs, aligning with recent advancements in Pretrained Language Models (PLMs).

Training

The model was pre-trained on the LeXFiles corpus, optimizing over 1 million steps with a batch size of 512 samples. The training utilized Adam optimizer, a learning rate of 0.0001, and a cosine learning rate scheduler. Key hyperparameters included a batch size of 32, a total training batch size of 512, and a TPU deployment with 8 devices. Training was distributed with gradient accumulation steps set to 2.

Training Hyperparameters

  • Learning Rate: 0.0001
  • Batch Size: 32 (Train), 32 (Eval)
  • Distributed Type: TPU
  • Num Devices: 8
  • Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
  • Scheduler: Cosine
  • Training Steps: 1,000,000

Training Results

The training process achieved progressive improvements in validation loss, evidencing effective learning and adaptation to the legal domain.

Framework Versions

  • Transformers: 4.20.0
  • PyTorch: 1.12.0+cu102
  • Datasets: 2.6.1
  • Tokenizers: 0.12.0

Guide: Running Locally

  1. Environment Setup:

    • Install PyTorch and Transformers libraries.
    • Ensure a compatible Python environment (e.g., Python 3.8 or higher).
  2. Model Download:

    • Download the legal-roberta-base model from Hugging Face's model hub.
  3. Inference:

    • Use the Fill-Mask pipeline to perform text infill tasks with legal text inputs.
  4. Suggested Cloud GPUs:

    • Utilize cloud platforms like AWS, GCP, or Azure for GPU instances to accelerate training and inference tasks.

License

The Legal-RoBERTa-Base model is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0), allowing sharing and adaptation with proper attribution.

More Related APIs in Fill Mask