banglat5_nmt_bn_en
csebuetnlpIntroduction
The BanglaT5_NMT_BN_EN model is a fine-tuned version of the BanglaT5 model, specifically designed for Bengali to English translation. It is built on the BanglaNMT dataset and is part of a suite of tools aimed at improving machine translation for low-resource languages like Bengali.
Architecture
The model is based on the T5 architecture and employs a sequence-to-sequence learning approach. It leverages a specific text normalization pipeline to preprocess input data, ensuring better translation accuracy.
Training
Training benchmarks indicate that the BanglaT5 model performs well compared to other models on the BanglaNMT test set, achieving a SacreBLEU score of 38.8. This score surpasses those of mT5 (base), XLM-ProphetNet, mBART-50, and IndicBART.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies:
- Use the
transformers
library (version 4.11.0.dev0 tested). - Install the
normalizer
for text preprocessing:pip install git+https://github.com/csebuetnlp/normalizer
- Use the
-
Load the Model and Tokenizer:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from normalizer import normalize model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_nmt_bn_en") tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglat5_nmt_bn_en", use_fast=False)
-
Input and Generate Translation:
input_sentence = "" input_ids = tokenizer(normalize(input_sentence), return_tensors="pt").input_ids generated_tokens = model.generate(input_ids) decoded_tokens = tokenizer.batch_decode(generated_tokens)[0] print(decoded_tokens)
-
Cloud GPUs: For enhanced performance, consider running the model on cloud-based GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). This license allows sharing and adaptation of the material for non-commercial purposes, provided appropriate credit is given and any derivatives are licensed under identical terms.