opus mt tc big en es LLM Model

Introduction

The OPUS-MT-TC-BIG-EN-ES model is a neural machine translation model developed for translating text from English to Spanish. It is part of the OPUS-MT project, which aims to make translation models accessible for various languages. The model was originally trained using the Marian NMT framework and later converted to PyTorch using Hugging Face's Transformers library.

Architecture

The model is based on the "transformer-big" architecture and utilizes SentencePiece tokenization. The original model was trained on a dataset named opusTCv20210807+bt_transformer-big, with data sourced from the Tatoeba Challenge.

Training

The training data for this model was sourced from the OPUS project, and the training pipeline followed the procedures of OPUS-MT-train. The BLEU scores on various datasets are as follows:

Tatoeba-test-v2021-08-07: 57.2
Newstest2010: 37.6
Newstest2011: 38.9
Newstest2012: 39.5

These scores indicate the model's performance across different testing datasets.

Guide: Running Locally

To run the model locally:

Install Transformers: Ensure that the Hugging Face Transformers library is installed in your Python environment.

Load the Model:

from transformers import MarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-tc-big-en-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

Translate Text:

src_text = ["Your text here"]
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
    print(tokenizer.decode(t, skip_special_tokens=True))

Use Cloud GPUs: For faster translation and handling larger datasets, consider using cloud-based GPUs such as those provided by AWS, GCP, or Azure.

License

The OPUS-MT-TC-BIG-EN-ES model is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing for sharing and adaptation with proper attribution.

More Related APIs in Translation