roberta base ca v2

projecte-aina

Introduction

The roberta-base-ca-v2 is a transformer-based masked language model specifically designed for the Catalan language. It is a variant of the RoBERTa base model, trained on a medium-sized corpus from publicly available sources and web crawlers. The model is primarily used for masked language modeling and is suitable for fine-tuning on various non-generative tasks such as Question Answering and Text Classification.

Architecture

The model is based on the RoBERTa architecture, utilizing a transformer-based approach for masked language modeling. It employs Byte-Pair Encoding (BPE) for tokenization, with a vocabulary size of 50,262 tokens. The model was trained leveraging 16 NVIDIA V100 GPUs over a period of 96 hours.

Training

Training Data

The training corpus includes multiple Catalan corpora sourced from web crawling and public datasets with a total size of several gigabytes. Key datasets include Catalan Crawling, Wikipedia, Open Subtitles, and more, amounting to a diverse and comprehensive data set.

Training Procedure

The training involved a standard masked language model approach using the same hyperparameters as the original RoBERTa model. The process involved tokenization with BPE and leveraged significant computational resources for effective training.

Guide: Running Locally

To run the roberta-base-ca-v2 model locally, use the following steps:

  1. Install Dependencies: Ensure you have the transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
    from pprint import pprint
    
    tokenizer = AutoTokenizer.from_pretrained('projecte-aina/roberta-base-ca-v2')
    model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-base-ca-v2')
    
  3. Create a Fill-Mask Pipeline:

    pipeline = FillMaskPipeline(model, tokenizer)
    text = "Em dic <mask>."
    res = pipeline(text)
    pprint([r['token_str'] for r in res])
    
  4. Suggest Cloud GPUs: For intensive tasks or training, consider using cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure for better performance.

License

The model is released under the Apache License 2.0, allowing for broad use and distribution. The full license details can be accessed at Apache License 2.0.

More Related APIs in Fill Mask