gpt2 large dutch
yhavingaIntroduction
The GPT2-Large-Dutch model is a GPT2 large model with 762 million parameters, specifically trained from scratch on cleaned Dutch mC4 data. It is designed for text generation tasks in Dutch, achieving a perplexity of 15.1.
Architecture
This model is based on the GPT2 architecture, utilizing a large configuration with 762 million parameters. It employs a BPE tokenizer crafted specifically for Dutch using the mC4 dataset and Hugging Face's Transformers library.
Training
The GPT2-Large-Dutch was trained on a dataset of 33 billion tokens from the cleaned Dutch mC4 dataset. The cleaning process involved removing documents with inappropriate content, very short sentences, excessively long words, and certain phrases. The training utilized the AdaFactor optimizer over 1,100,000 steps, supported by computational resources from Google TPU Research Cloud.
Guide: Running Locally
To run the GPT2-Large-Dutch model locally, follow these steps:
-
Install the Transformers Library:
Install the Hugging Face Transformers library if not already installed.pip install transformers
-
Load the Model and Tokenizer:
Use the following Python code to load the model and tokenizer.from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel MODEL_DIR = 'yhavinga/gpt2-large-dutch' tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR) model = GPT2LMHeadModel.from_pretrained(MODEL_DIR) generator = pipeline('text-generation', model=model, tokenizer=tokenizer) generated_text = generator('Het eiland West-', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0)
-
Execution:
Run the script to generate text based on a Dutch prompt.
Cloud GPU Suggestion: For enhanced performance, consider using cloud GPUs from providers like Google Cloud or AWS.
License
The model and its components adhere to the licensing terms provided by Hugging Face. Specific details regarding the model's license can be found directly on the model's Hugging Face page.