gpt2 persian
bolbolzabanIntroduction
bolbolzaban/gpt2-persian
is a GPT-2 language model tailored for the Persian language. It is designed with hyperparameters similar to the standard GPT-2 medium model, with modifications to suit Persian text processing.
Architecture
The model differs from the standard GPT-2 in the following ways:
- Context Size: Reduced from 1024 to 256 subwords to make training more affordable.
- Tokenization: Utilizes Google SentencePiece tokenizer instead of Byte Pair Encoding (BPE).
- Dataset: Exclusively trained on Persian text, with non-Persian characters replaced by special tokens like
[LAT]
,[URL]
, and[NUM]
.
Training
The model is primarily trained on Persian text to explore its applications in Persian poetry. It replaces English words and numbers with special tokens, using only the standard Persian alphabet in its training data.
Guide: Running Locally
To use the model locally, follow these steps:
-
Install the
transformers
library:pip install transformers
-
Use the model for text generation:
from transformers import pipeline, AutoTokenizer, GPT2LMHeadModel tokenizer = AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian') model = GPT2LMHeadModel.from_pretrained('bolbolzaban/gpt2-persian') generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':256}) sample = generator('در یک اتفاق شگفت انگیز، پژوهشگران')
-
For TensorFlow users, replace
GPT2LMHeadModel
withTFGPT2LMHeadModel
.
Cloud GPUs
Consider using cloud GPU providers such as Google Colab, AWS, or Azure for enhanced performance during model inference or fine-tuning.
License
The model is licensed under the Apache 2.0 License.