gpt for est base

tartuNLP

Introduction

GPT-EST-BASE is a base-size GPT-2 model designed for generating text in Estonian. It was trained from scratch with a large dataset comprising 2.2 billion words, including sources like the Estonian National Corpus, News Crawl, and Common Crawl. The model underwent three training epochs. Initially named "gpt-4-est-base," it was renamed to avoid misleading implications about its capabilities.

Architecture

The model features a 12-layer transformer architecture with 12 attention heads per layer. It has an embedding size of 768 and a context size of 1024, amounting to a total of approximately 118.68 million parameters.

Training

Training involved using a dataset with a text domain tag prepended to each example. The tags indicate the text's domain: >general<, >web<, >news<, >doaj<, and >wiki<. These tags should be used as prefixes when inputting text to the model to ensure domain-specific context, such as ">web< Kas tead, et".

Guide: Running Locally

  1. Setup Environment:
    • Install the required frameworks:
      pip install transformers==4.13.0.dev0
      pip install torch==1.10.0+cu102
      pip install datasets==1.15.1
      pip install tokenizers==0.10.3
      
  2. Download Model:
    • Access GPT-EST-BASE from the Hugging Face model hub.
  3. Run Inference:
    • Load the model using the Transformers library and perform inference with appropriate text prefixes.

Cloud GPUs: Consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for more efficient processing.

License

Details about the licensing of GPT-EST-BASE are not provided in the documentation. It is advisable to refer to the Hugging Face repository or contact the maintainers for precise licensing information.

More Related APIs in Text Generation