t5 small paraphrase ro
BlackKakapoIntroduction
The t5-small-paraphrase-ro
model by BlackKakapo is a fine-tuned version of the T5 model for generating paraphrases in Romanian. It was developed due to the absence of a dedicated Romanian paraphrasing dataset, which led to the creation of a custom dataset consisting of approximately 60,000 examples.
Architecture
This model utilizes the T5 architecture, specifically fine-tuned for text-to-text generation tasks in Romanian. It operates within the framework of the Transformers library with PyTorch, focusing on generating alternative textual representations in Romanian.
Training
The model was trained using a custom dataset created by BlackKakapo, specifically designed for Romanian paraphrasing. This dataset includes around 60,000 examples and is available at Hugging Face Datasets. The training process involved fine-tuning the T5-small model to adapt it for paraphrasing tasks.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install the Transformers Library:
Ensure you have the Transformers library installed:pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("BlackKakapo/t5-small-paraphrase-ro") model = AutoModelForSeq2SeqLM.from_pretrained("BlackKakapo/t5-small-paraphrase-ro")
-
Generate Paraphrases:
text = "Am impresia că fac multe greșeli." encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt") input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device) beam_outputs = model.generate( input_ids=input_ids, attention_mask=attention_masks, do_sample=True, max_length=256, top_k=10, top_p=0.9, early_stopping=False, num_return_sequences=5 ) for beam_output in beam_outputs: text_para = tokenizer.decode(beam_output, skip_special_tokens=True, clean_up_tokenization_spaces=True) if text.lower() != text_para.lower(): final_outputs.append(text_para) break print(final_outputs) # Example Output: ['Cred că fac multe greșeli.']
-
Consider Using Cloud GPUs:
For intensive tasks, it's advisable to utilize cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure to enhance performance.
License
The t5-small-paraphrase-ro
model is licensed under the Apache 2.0 License, permitting usage, modification, and distribution.