nepali B E R T

Shushant

Introduction

NepaliBERT is a masked language model specifically designed for the Nepali language. It is trained on a substantial dataset of Nepali news articles, enhancing its proficiency in various NLP tasks related to the Devanagari script.

Architecture

NepaliBERT is a fine-tuned version of the BERT Base Uncased model. The architecture is pre-trained on a corpus of Nepali news articles, comprising approximately 4.6 GB of text data. The model is optimized for tasks involving the Nepali language, achieving state-of-the-art performance with a perplexity score of 8.56.

Training

The training dataset consists of approximately 85,467 news articles, totaling around 4.3 GB of text. Evaluation was conducted on a smaller dataset of approximately 12 MB. Training was performed using the Hugging Face Trainer API on a Tesla V100 GPU, facilitated by Kathmandu University's supercomputer. The training process lasted approximately 3 days, 8 hours, and 57 minutes.

Guide: Running Locally

To use NepaliBERT locally, follow these steps:

  1. Install Transformers Library: Ensure you have the Hugging Face Transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    tokenizer = AutoTokenizer.from_pretrained("Shushant/nepaliBERT")
    model = AutoModelForMaskedLM.from_pretrained("Shushant/nepaliBERT")
    
  3. Setup the Fill-Mask Pipeline:

    from transformers import pipeline
    
    fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
    
  4. Inference Example:

    from pprint import pprint
    
    pprint(fill_mask(f"तिमीलाई कस्तो {tokenizer.mask_token}."))
    

For optimal performance, consider using cloud GPUs such as those offered by Google Cloud, AWS, or Azure.

License

NepaliBERT is distributed under the MIT License, allowing for broad use and modification by the community.

More Related APIs in Fill Mask