gpt for est base
tartuNLPIntroduction
GPT-EST-BASE is a base-size GPT-2 model designed for generating text in Estonian. It was trained from scratch with a large dataset comprising 2.2 billion words, including sources like the Estonian National Corpus, News Crawl, and Common Crawl. The model underwent three training epochs. Initially named "gpt-4-est-base," it was renamed to avoid misleading implications about its capabilities.
Architecture
The model features a 12-layer transformer architecture with 12 attention heads per layer. It has an embedding size of 768 and a context size of 1024, amounting to a total of approximately 118.68 million parameters.
Training
Training involved using a dataset with a text domain tag prepended to each example. The tags indicate the text's domain: >general<, >web<, >news<, >doaj<, and >wiki<. These tags should be used as prefixes when inputting text to the model to ensure domain-specific context, such as ">web< Kas tead, et".
Guide: Running Locally
- Setup Environment:
- Install the required frameworks:
pip install transformers==4.13.0.dev0 pip install torch==1.10.0+cu102 pip install datasets==1.15.1 pip install tokenizers==0.10.3
- Install the required frameworks:
- Download Model:
- Access GPT-EST-BASE from the Hugging Face model hub.
- Run Inference:
- Load the model using the Transformers library and perform inference with appropriate text prefixes.
Cloud GPUs: Consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for more efficient processing.
License
Details about the licensing of GPT-EST-BASE are not provided in the documentation. It is advisable to refer to the Hugging Face repository or contact the maintainers for precise licensing information.