nepali B E R T
ShushantIntroduction
NepaliBERT is a masked language model specifically designed for the Nepali language. It is trained on a substantial dataset of Nepali news articles, enhancing its proficiency in various NLP tasks related to the Devanagari script.
Architecture
NepaliBERT is a fine-tuned version of the BERT Base Uncased model. The architecture is pre-trained on a corpus of Nepali news articles, comprising approximately 4.6 GB of text data. The model is optimized for tasks involving the Nepali language, achieving state-of-the-art performance with a perplexity score of 8.56.
Training
The training dataset consists of approximately 85,467 news articles, totaling around 4.3 GB of text. Evaluation was conducted on a smaller dataset of approximately 12 MB. Training was performed using the Hugging Face Trainer API on a Tesla V100 GPU, facilitated by Kathmandu University's supercomputer. The training process lasted approximately 3 days, 8 hours, and 57 minutes.
Guide: Running Locally
To use NepaliBERT locally, follow these steps:
-
Install Transformers Library: Ensure you have the Hugging Face Transformers library installed.
pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT") model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")
-
Setup the Fill-Mask Pipeline:
from transformers import pipeline fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
-
Inference Example:
from pprint import pprint pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))
For optimal performance, consider using cloud GPUs such as those offered by Google Cloud, AWS, or Azure.
License
NepaliBERT is distributed under the MIT License, allowing for broad use and modification by the community.