gpt2 small portuguese

pierreguillou

Introduction

GPorTuguese-2, or Portuguese GPT-2 small, is a language model developed for Portuguese text generation and other NLP tasks. It is based on the GPT-2 small model and was fine-tuned using Portuguese Wikipedia data. The model was trained using Transfer Learning and Fine-tuning techniques over the course of a day on a single NVIDIA V100 32GB GPU.

Architecture

The model is a fine-tuned version of the English GPT-2 small model, utilizing Hugging Face’s Transformers and Tokenizers libraries within the fastai v2 Deep Learning framework. The architecture consists of 124 million parameters and is designed for generating text by predicting the next word in a sequence.

Training

GPorTuguese-2 was trained on 1.28 GB of Portuguese Wikipedia text with a validation set of 0.32 GB. The training achieved a loss of 3.17, an accuracy of 37.99%, and a perplexity of 23.76 after five epochs. The training process involved distributed data parallel (DDP) training to optimize the time, which could be reduced further with multiple GPUs.

Guide: Running Locally

To run GPorTuguese-2 locally using PyTorch, follow these steps:

  1. Install Prerequisites:
    pip install transformers torch
    
  2. Load the Model and Tokenizer:
    from transformers import AutoTokenizer, AutoModelWithLMHead
    tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese")
    model = AutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
    
  3. Generate Text:
    text = "Quem era Jim Henson? Jim Henson era um"
    inputs = tokenizer(text, return_tensors="pt")
    sample_outputs = model.generate(inputs.input_ids, max_length=50, top_k=40)
    print(tokenizer.decode(sample_outputs[0]))
    

For TensorFlow, replace the model loading and generation code with TensorFlow-specific functions provided in the documentation.

Cloud GPUs: Consider using cloud services like AWS EC2, Google Cloud, or Azure to leverage cloud GPUs for more efficient computation.

License

GPorTuguese-2 is released under the MIT License, allowing for broad reuse and modification with proper attribution.

More Related APIs in Text Generation