banglabert
csebuetnlpIntroduction
BanglaBERT is a pretrained ELECTRA discriminator model designed for natural language understanding tasks in Bengali. This model was trained using the Replaced Token Detection (RTD) objective and achieves state-of-the-art results in various NLP tasks such as sentiment classification, named entity recognition, and natural language inference.
Architecture
BanglaBERT is built upon the ELECTRA architecture, which is known for efficiently training transformer models. It utilizes a discriminator that identifies replaced tokens in input sequences, allowing for effective pretraining.
Training
The model was pretrained on a large corpus of Bengali text, utilizing a normalization pipeline to standardize the input data. This preprocessing step is crucial for achieving optimal performance on downstream tasks. The model's training involved large-scale data collection and processing, resulting in significant improvements over previous multilingual and monolingual models.
Guide: Running Locally
To run BanglaBERT locally, follow these steps:
-
Install Required Libraries:
Ensure you have Python and PyTorch installed. Use pip to install the Hugging Face Transformers library and the custom normalizer:pip install torch transformers pip install git+https://github.com/csebuetnlp/normalizer
-
Load the Model and Tokenizer:
Use the following code to load BanglaBERT:from transformers import AutoModelForPreTraining, AutoTokenizer from normalizer import normalize import torch model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert") tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
-
Prepare Input Data:
Normalize and tokenize your input sentences:sentence = "আমি কৃতজ্ঞ কারণ আপনি আমার জন্য অনেক কিছু করেছেন।" normalized_sentence = normalize(sentence) inputs = tokenizer.encode(normalized_sentence, return_tensors="pt")
-
Run Model Predictions:
Pass the tokenized inputs through the model to get predictions:outputs = model(inputs).logits predictions = torch.round((torch.sign(outputs) + 1) / 2)
-
Utilize Cloud GPUs:
For improved performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure to handle the computational needs of running BanglaBERT.
License
BanglaBERT is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This allows for sharing and adaptation with attribution, but not for commercial use.