Introduction

bolbolzaban/gpt2-persian is a GPT-2 language model tailored for the Persian language. It is designed with hyperparameters similar to the standard GPT-2 medium model, with modifications to suit Persian text processing.

Architecture

The model differs from the standard GPT-2 in the following ways:

  1. Context Size: Reduced from 1024 to 256 subwords to make training more affordable.
  2. Tokenization: Utilizes Google SentencePiece tokenizer instead of Byte Pair Encoding (BPE).
  3. Dataset: Exclusively trained on Persian text, with non-Persian characters replaced by special tokens like [LAT], [URL], and [NUM].

Training

The model is primarily trained on Persian text to explore its applications in Persian poetry. It replaces English words and numbers with special tokens, using only the standard Persian alphabet in its training data.

Guide: Running Locally

To use the model locally, follow these steps:

  1. Install the transformers library:

    pip install transformers
    
  2. Use the model for text generation:

    from transformers import pipeline, AutoTokenizer, GPT2LMHeadModel
    
    tokenizer = AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian')
    model = GPT2LMHeadModel.from_pretrained('bolbolzaban/gpt2-persian')
    generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':256})
    sample = generator('در یک اتفاق شگفت انگیز، پژوهشگران')
    
  3. For TensorFlow users, replace GPT2LMHeadModel with TFGPT2LMHeadModel.

Cloud GPUs

Consider using cloud GPU providers such as Google Colab, AWS, or Azure for enhanced performance during model inference or fine-tuning.

License

The model is licensed under the Apache 2.0 License.

More Related APIs in Text Generation