t5 small spanish nahuatl

somosnlp-hackathon-2022

Introduction

The T5-SMALL-SPANISH-NAHUATL model is a T5 Transformer fine-tuned for translating between Spanish and Nahuatl. Nahuatl, a widely spoken indigenous language in Mexico, presents challenges for machine translation due to scarce and variant-rich datasets. This model addresses these challenges by leveraging the T5 text-to-text prefix training strategy to adapt a multilingual model initially trained on Spanish.

Architecture

The model is based on the T5 Transformer architecture, specifically the T5-small version. It uses a text-to-text format, where input sequences are transformed into output sequences, facilitating tasks like translation. The architecture supports multilingual capabilities and uses prefix tokens to guide the translation process.

Training

Dataset

The training datasets include the Axolotl corpus and the bible-corpus, augmented with additional samples collected online. The Axolotl corpus required careful selection due to misalignments, resulting in approximately 12,207 high-quality samples. The bible-corpus provided an additional 7,821 samples.

Model and Training Process

Training involved two stages:

  1. Learning Spanish: Utilizing a Spanish-English Anki dataset with 118,964 text pairs, the model learned Spanish while maintaining existing knowledge.
  2. Learning Nahuatl: Building on the Spanish-English model, additional Nahuatl pairs and 20,000 English-Spanish samples were introduced to avoid overfitting.

The training was conducted over 660,000 steps with a batch size of 16 and a learning rate of 2e-5.

Evaluation

The model was evaluated using the chrf and sacrebleu metrics on a validation set of 505 Nahuatl sentences. English-Spanish pretraining improved BLEU and Chrf scores and accelerated convergence.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Transformers Library:

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
    
    model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
    tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
    
  3. Translate a Sentence:

    sentence = 'translate Spanish to Nahuatl: muchas flores son blancas'
    input_ids = tokenizer(sentence, return_tensors='pt').input_ids
    outputs = model.generate(input_ids)
    translated_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    

For faster performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The T5-SMALL-SPANISH-NAHUATL model is released under the Apache-2.0 license, allowing for free use and distribution with appropriate credit to the original authors.

More Related APIs in Translation