t5 russian spell

UrukHan

T5-RUSSIAN-SPELL

Introduction

T5-RUSSIAN-SPELL is a model designed for correcting text transcribed from audio using the T5 architecture. It is compatible with output from the UrukHan/wav2vec2-russian model, which transcribes Russian audio, providing improved text accuracy.

Architecture

The model is based on the T5 architecture, which is a transformer model known for its flexibility in text-to-text tasks. It has been fine-tuned specifically for the task of spell correction in the Russian language.

Training

The model was trained using datasets tailored for spell correction:

The training process involved using the Seq2SeqTrainer from Hugging Face's Transformers library, leveraging the Adafactor optimizer.

Guide: Running Locally

Installation Steps

  1. Install necessary libraries:

    !pip install transformers datasets sentencepiece rouge_score
    !apt install git-lfs
    
  2. Import libraries and login to Hugging Face:

    from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast
    from huggingface_hub import notebook_login
    
    notebook_login()  # Login to your Hugging Face account
    
  3. Load the model and tokenizer:

    MODEL_NAME = 'UrukHan/t5-russian-spell'
    tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
    
  4. Prepare input data:

    input_sequences = ['сеглдыя хорош ден', 'когд а вы прдет к нам в госи']
    task_prefix = "Spell correct: "
    encoded = tokenizer(
      [task_prefix + seq for seq in input_sequences],
      padding="longest",
      max_length=256,
      truncation=True,
      return_tensors="pt",
    )
    
  5. Generate predictions:

    predicts = model.generate(encoded.input_ids)
    tokenizer.batch_decode(predicts, skip_special_tokens=True)
    

Suggestion for Cloud GPUs

For efficient training and inference, consider using cloud-based GPU services such as Google Colab, AWS EC2 with GPU instances, or Azure Machine Learning.

License

The model is available under the terms specified by UrukHan on the Hugging Face platform. Please refer to the model's page for detailed licensing information.

More Related APIs in Text2text Generation