bert2bert turkish paraphrase generation
ahmetbagciIntroduction
The BERT2BERT-TURKISH-PARAPHRASE-GENERATION model is designed for generating paraphrases in Turkish. It leverages the encoder-decoder architecture with BERT as its underlying framework, specifically utilizing the PyTorch library.
Architecture
This model employs an encoder-decoder architecture, which is a common structure for text-to-text generation tasks. It uses BERT for both encoding and decoding processes, facilitating effective paraphrasing through a seq2seq approach.
Training
The model was trained using a dataset that combines the translated QQP dataset with a manually generated dataset. This combination aims to enhance the quality and accuracy of paraphrase generation in Turkish. The dataset for training can be accessed here.
Guide: Running Locally
To use the BERT2BERT-TURKISH-PARAPHRASE-GENERATION model locally, follow these steps:
-
Install Transformers Library:
Ensure that thetransformers
library is installed using pip:pip install transformers
-
Load Tokenizer and Model:
Use the following Python script to load the tokenizer and model:from transformers import BertTokenizerFast, EncoderDecoderModel tokenizer = BertTokenizerFast.from_pretrained("dbmdz/bert-base-turkish-cased") model = EncoderDecoderModel.from_pretrained("ahmetbagci/bert2bert-turkish-paraphrase-generation")
-
Generate Paraphrases:
Prepare your input text and generate paraphrases as shown:text = "son model arabalar çevreye daha mı az zarar veriyor?" input_ids = tokenizer(text, return_tensors="pt").input_ids output_ids = model.generate(input_ids) print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
-
Cloud GPUs:
For efficient processing, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The model and associated data are likely subject to licensing terms, but specific license details are not provided in the documentation. It is advisable to check the repository or contact the author for accurate licensing information.