opus mt tc big en ko LLM Model

Introduction

The OPUS-MT-TC-BIG-EN-KO model is a neural machine translation model developed by the Language Technology Research Group at the University of Helsinki. It is designed to translate text from English (en) to Korean (ko) and is part of the OPUS-MT project, which aims to provide accessible translation models for various languages. The model is based on Marian NMT and has been converted to PyTorch using the Hugging Face Transformers library.

Architecture

Model Type: Translation (transformer-big)
Original Model: The model was originally trained using Marian NMT and is available in a zip file format.
Conversion: The model has been converted to PyTorch using the Transformers library from Hugging Face.
Language: Supports translation from English to Korean.

Training

Data: The training data originates from the OPUS dataset.
Pre-processing: Utilizes SentencePiece for data tokenization.
Model Type: Transformer-big architecture.
Training Scripts: Available on the OPUS-MT-train GitHub repository.
Evaluation: Achieved a BLEU score of 13.7 and a chr-F score of 0.36399 on the flores101-devtest dataset.

Guide: Running Locally

To run the model locally, follow these steps:

Install Transformers:
```
pip install transformers
```

Load the Model:

from transformers import MarianMTModel, MarianTokenizer

src_text = ["2, 4, 6 etc. are even numbers.", "Yes."]
model_name = "Helsinki-NLP/opus-mt-tc-big-en-ko"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print(tokenizer.decode(t, skip_special_tokens=True))

Using Pipelines:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-ko")
print(pipe("2, 4, 6 etc. are even numbers."))

Cloud GPU Recommendation: Utilize cloud services such as AWS, GCP, or Azure to access GPUs for faster model inference.

License

The OPUS-MT-TC-BIG-EN-KO model is distributed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0).

More Related APIs in Translation