Legal-BERT

Introduction

Legal-BERT is a specialized model and tokenizer designed for legal text, aiming to enhance the performance of BERT in the legal domain. It was developed as part of the research "When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset" and utilizes the CaseHOLD Dataset, encompassing over 53,000 legal holdings.

Architecture

The model is based on the BERT architecture, specifically the bert-base-uncased variant, which contains 110 million parameters. It adapts the tokenization and sentence segmentation processes to suit legal texts.

Training

The Legal-BERT model was pretrained using a large corpus derived from the Harvard Law case database, spanning documents from 1965 to the present. This dataset comprises 37GB of data, significantly larger than the original BookCorpus/Wikipedia corpus used for training BERT. The pretraining process involved an additional 1 million steps using the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives.

Guide: Running Locally

  1. Clone the casehold repository from GitHub to access scripts for pretraining loss computation and finetuning: casehold repository.
  2. Load the Legal-BERT model and tokenizer files from the Hugging Face model hub.
  3. Use the scripts to finetune the model for tasks such as classification and multiple choice tasks as described in the original research.
  4. For optimal performance, consider utilizing cloud GPUs, such as those offered by Google Cloud, AWS, or Azure.

License

The Legal-BERT model and its associated resources are available under the terms specified in the casehold repository. Users should refer to the repository for specific licensing details.

More Related APIs in Fill Mask