gpt2 turkish writer

gorkemgoknar

Introduction

The GPT-2 Turkish Writer is a fine-tuned model based on GPT-2 Small. It has been trained with a dataset that includes Turkish Wikipedia articles and over 400 classic novels and plays in Turkish. The model is designed for text generation tasks in the Turkish language.

Architecture

The model builds upon the GPT-2 Small architecture but is further fine-tuned specifically for the Turkish language. Due to the linguistic differences between Turkish and English, additional layers are trained compared to models for languages more similar to English.

Training

The model utilizes a training dataset composed of a Turkish Wikipedia article dump (as of October 28, 2020) and a collection of Turkish literature including works by Dostoyevski, Shakespeare, and Dumas. The training process involved adjusting the last three layers of the GPT-2 model using Fastai 2.X and Google Colab. The model's evaluation metrics include an accuracy of 36.3% and a perplexity of 44.75.

Guide: Running Locally

To run the GPT-2 Turkish Writer model locally, follow these steps:

  1. Install Dependencies:
    Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face.

    pip install transformers torch
    
  2. Load the Model:
    Use the following code snippet to load the model and tokenizer.

    from transformers import AutoTokenizer, AutoModelWithLMHead
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
    model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
    tokenizer.model_max_length = 1024
    model.eval()
    
  3. Generate Text:
    Use the model to generate text by providing an input sequence. You can generate a full sequence using Top-k sampling.

    text = "Bu yazıyı bilgisayar yazdı."
    inputs = tokenizer(text, return_tensors="pt")
    sample_outputs = model.generate(inputs.input_ids, pad_token_id=50256, do_sample=True, max_length=50, top_k=40, num_return_sequences=1)
    
    for i, sample_output in enumerate(sample_outputs):
        print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
    
  4. Hardware Suggestions:
    For optimal performance, consider using a cloud service with GPU support, such as Google Colab or AWS EC2 with GPU instances.

License

The GPT-2 Turkish Writer model is made available under the Apache 2.0 License. This allows for both personal and commercial use, modification, and distribution, provided that appropriate credit is given and any modifications are documented.

More Related APIs in Text Generation