t5 russian spell
UrukHanT5-RUSSIAN-SPELL
Introduction
T5-RUSSIAN-SPELL is a model designed for correcting text transcribed from audio using the T5 architecture. It is compatible with output from the UrukHan/wav2vec2-russian
model, which transcribes Russian audio, providing improved text accuracy.
Architecture
The model is based on the T5 architecture, which is a transformer model known for its flexibility in text-to-text tasks. It has been fine-tuned specifically for the task of spell correction in the Russian language.
Training
The model was trained using datasets tailored for spell correction:
The training process involved using the Seq2SeqTrainer from Hugging Face's Transformers library, leveraging the Adafactor optimizer.
Guide: Running Locally
Installation Steps
-
Install necessary libraries:
!pip install transformers datasets sentencepiece rouge_score !apt install git-lfs
-
Import libraries and login to Hugging Face:
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast from huggingface_hub import notebook_login notebook_login() # Login to your Hugging Face account
-
Load the model and tokenizer:
MODEL_NAME = 'UrukHan/t5-russian-spell' tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME) model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
-
Prepare input data:
input_sequences = ['сеглдыя хорош ден', 'когд а вы прдет к нам в госи'] task_prefix = "Spell correct: " encoded = tokenizer( [task_prefix + seq for seq in input_sequences], padding="longest", max_length=256, truncation=True, return_tensors="pt", )
-
Generate predictions:
predicts = model.generate(encoded.input_ids) tokenizer.batch_decode(predicts, skip_special_tokens=True)
Suggestion for Cloud GPUs
For efficient training and inference, consider using cloud-based GPU services such as Google Colab, AWS EC2 with GPU instances, or Azure Machine Learning.
License
The model is available under the terms specified by UrukHan on the Hugging Face platform. Please refer to the model's page for detailed licensing information.