roberta base ca v2
projecte-ainaIntroduction
The roberta-base-ca-v2
is a transformer-based masked language model specifically designed for the Catalan language. It is a variant of the RoBERTa base model, trained on a medium-sized corpus from publicly available sources and web crawlers. The model is primarily used for masked language modeling and is suitable for fine-tuning on various non-generative tasks such as Question Answering and Text Classification.
Architecture
The model is based on the RoBERTa architecture, utilizing a transformer-based approach for masked language modeling. It employs Byte-Pair Encoding (BPE) for tokenization, with a vocabulary size of 50,262 tokens. The model was trained leveraging 16 NVIDIA V100 GPUs over a period of 96 hours.
Training
Training Data
The training corpus includes multiple Catalan corpora sourced from web crawling and public datasets with a total size of several gigabytes. Key datasets include Catalan Crawling, Wikipedia, Open Subtitles, and more, amounting to a diverse and comprehensive data set.
Training Procedure
The training involved a standard masked language model approach using the same hyperparameters as the original RoBERTa model. The process involved tokenization with BPE and leveraged significant computational resources for effective training.
Guide: Running Locally
To run the roberta-base-ca-v2
model locally, use the following steps:
-
Install Dependencies: Ensure you have the
transformers
library installed.pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline from pprint import pprint tokenizer = AutoTokenizer.from_pretrained('projecte-aina/roberta-base-ca-v2') model = AutoModelForMaskedLM.from_pretrained('projecte-aina/roberta-base-ca-v2')
-
Create a Fill-Mask Pipeline:
pipeline = FillMaskPipeline(model, tokenizer) text = "Em dic <mask>." res = pipeline(text) pprint([r['token_str'] for r in res])
-
Suggest Cloud GPUs: For intensive tasks or training, consider using cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure for better performance.
License
The model is released under the Apache License 2.0, allowing for broad use and distribution. The full license details can be accessed at Apache License 2.0.