t5 small paraphrase ro

BlackKakapo

Introduction

The t5-small-paraphrase-ro model by BlackKakapo is a fine-tuned version of the T5 model for generating paraphrases in Romanian. It was developed due to the absence of a dedicated Romanian paraphrasing dataset, which led to the creation of a custom dataset consisting of approximately 60,000 examples.

Architecture

This model utilizes the T5 architecture, specifically fine-tuned for text-to-text generation tasks in Romanian. It operates within the framework of the Transformers library with PyTorch, focusing on generating alternative textual representations in Romanian.

Training

The model was trained using a custom dataset created by BlackKakapo, specifically designed for Romanian paraphrasing. This dataset includes around 60,000 examples and is available at Hugging Face Datasets. The training process involved fine-tuning the T5-small model to adapt it for paraphrasing tasks.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Transformers Library:
    Ensure you have the Transformers library installed:

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    tokenizer = AutoTokenizer.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")
    model = AutoModelForSeq2SeqLM.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")
    
  3. Generate Paraphrases:

    text = "Am impresia că fac multe greșeli."
    
    encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
    input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
    
    beam_outputs = model.generate(
        input_ids=input_ids, 
        attention_mask=attention_masks,
        do_sample=True,
        max_length=256,
        top_k=10,
        top_p=0.9,
        early_stopping=False,
        num_return_sequences=5
    )
    
    for beam_output in beam_outputs:
        text_para = tokenizer.decode(beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    
        if text.lower() != text_para.lower():
            final_outputs.append(text_para)
            break
    
    print(final_outputs)  # Example Output: ['Cred că fac multe greșeli.']
    
  4. Consider Using Cloud GPUs:
    For intensive tasks, it's advisable to utilize cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure to enhance performance.

License

The t5-small-paraphrase-ro model is licensed under the Apache 2.0 License, permitting usage, modification, and distribution.

More Related APIs in Text2text Generation