Introduction

The RUT5-SMALL model is a Russian paraphrasing model based on Google's mt5-small architecture. It offers a compact solution for paraphrasing tasks but has limited performance, which can be improved through fine-tuning for specific applications.

Architecture

The model is a streamlined version of the alenusch/mt5small-ruparaphraser. It reduces the original model's size by focusing on Russian-related vocabulary. The vocabulary was shrunk from 250K to 20K tokens, reducing the model's parameters from 300M to 65M and the size from 1.1GB to 246MB. The first 5K tokens come from the original mt5-small, while the next 15K are the most frequent tokens from the Leipzig Russian web corpus.

Training

The RUT5-SMALL model inherits its architecture from the mt5-small model, which was adjusted by significantly reducing its vocabulary to focus on Russian language tokens. This adaptation aimed to retain essential Russian vocabulary while improving efficiency.

Guide: Running Locally

  1. Install Required Packages:

    pip install transformers sentencepiece
    
  2. Load the Model and Tokenizer:

    import torch
    from transformers import T5ForConditionalGeneration, T5Tokenizer
    
    tokenizer = T5Tokenizer.from_pretrained("cointegrated/rut5-small")
    model = T5ForConditionalGeneration.from_pretrained("cointegrated/rut5-small")
    
  3. Generate Paraphrases:

    text = 'Ехал Грека через реку, видит Грека в реке рак.'
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        hypotheses = model.generate(
            **inputs, 
            do_sample=True, top_p=0.95, num_return_sequences=10, 
            repetition_penalty=2.5,
            max_length=32,
        )
    for h in hypotheses:
        print(tokenizer.decode(h, skip_special_tokens=True))
    

Suggested Cloud GPUs: Consider using cloud platforms such as AWS EC2, Google Cloud, or Azure with GPU instances for efficient processing.

License

The RUT5-SMALL model is distributed under the MIT license, allowing for broad use, modification, and distribution with appropriate credit.

More Related APIs in Text2text Generation