In Legal B E R T
law-aiIntroduction
InLegalBERT is a pre-trained language model tailored for the Indian legal domain. It utilizes a large corpus of Indian legal documents to enhance natural language processing tasks within this specialized field. The model is built upon the LEGAL-BERT-SC model and is further refined through additional training on Indian legal texts.
Architecture
InLegalBERT shares the architecture of the bert-base-uncased model, featuring:
- 12 hidden layers
- 768 hidden dimensions
- 12 attention heads
- Approximately 110 million parameters
The model leverages the same tokenizer as LegalBERT, ensuring consistency in handling legal texts.
Training
Training Data
The pre-training corpus comprises around 5.4 million legal documents from the Indian Supreme Court and various High Courts, spanning from 1950 to 2019. The dataset covers diverse legal domains, such as Civil, Criminal, and Constitutional law, amounting to approximately 27 GB of text data.
Training Setup
The model is initialized with the LEGAL-BERT-SC model and further trained for 300,000 steps using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
Fine-Tuning Results
InLegalBERT demonstrates superior performance compared to LegalBERT across three legal tasks with Indian datasets:
- Legal Statute Identification (ILSI Dataset)
- Semantic Segmentation (ISS Dataset)
- Court Judgment Prediction (ILDC Dataset)
Guide: Running Locally
To use InLegalBERT for text embedding, follow these steps:
- Install Transformers: Ensure you have the Hugging Face Transformers library installed:
pip install transformers
- Import and Load Model:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT") model = AutoModel.from_pretrained("law-ai/InLegalBERT")
- Encode and Process Text:
text = "Replace this string with yours" encoded_input = tokenizer(text, return_tensors="pt") output = model(**encoded_input) last_hidden_state = output.last_hidden_state
For optimal performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.
License
InLegalBERT is distributed under the MIT License, allowing for flexible reuse and modification.