In Legal B E R T

law-ai

Introduction

InLegalBERT is a pre-trained language model tailored for the Indian legal domain. It utilizes a large corpus of Indian legal documents to enhance natural language processing tasks within this specialized field. The model is built upon the LEGAL-BERT-SC model and is further refined through additional training on Indian legal texts.

Architecture

InLegalBERT shares the architecture of the bert-base-uncased model, featuring:

  • 12 hidden layers
  • 768 hidden dimensions
  • 12 attention heads
  • Approximately 110 million parameters

The model leverages the same tokenizer as LegalBERT, ensuring consistency in handling legal texts.

Training

Training Data

The pre-training corpus comprises around 5.4 million legal documents from the Indian Supreme Court and various High Courts, spanning from 1950 to 2019. The dataset covers diverse legal domains, such as Civil, Criminal, and Constitutional law, amounting to approximately 27 GB of text data.

Training Setup

The model is initialized with the LEGAL-BERT-SC model and further trained for 300,000 steps using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.

Fine-Tuning Results

InLegalBERT demonstrates superior performance compared to LegalBERT across three legal tasks with Indian datasets:

  • Legal Statute Identification (ILSI Dataset)
  • Semantic Segmentation (ISS Dataset)
  • Court Judgment Prediction (ILDC Dataset)

Guide: Running Locally

To use InLegalBERT for text embedding, follow these steps:

  1. Install Transformers: Ensure you have the Hugging Face Transformers library installed:
    pip install transformers
    
  2. Import and Load Model:
    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
    model = AutoModel.from_pretrained("law-ai/InLegalBERT")
    
  3. Encode and Process Text:
    text = "Replace this string with yours"
    encoded_input = tokenizer(text, return_tensors="pt")
    output = model(**encoded_input)
    last_hidden_state = output.last_hidden_state
    

For optimal performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.

License

InLegalBERT is distributed under the MIT License, allowing for flexible reuse and modification.

More Related APIs in Fill Mask