xlm roberta base
FacebookAIIntroduction
XLM-RoBERTa is a multilingual model pre-trained on 2.5TB of filtered CommonCrawl data across 100 languages. It was introduced in the paper "Unsupervised Cross-lingual Representation Learning at Scale" by Conneau et al. This model is intended for tasks like sequence classification, token classification, and question answering, rather than text generation.
Architecture
XLM-RoBERTa is a transformer-based model that extends RoBERTa for multilingual tasks. It uses masked language modeling (MLM) for pre-training, where 15% of words in a sentence are masked, and the model predicts these words. This approach helps the model learn a bidirectional representation of text, enabling it to extract meaningful features from multilingual data.
Training
The model was trained in a self-supervised manner on a large corpus, without human-labeled data, by automatically generating inputs and labels. The MLM objective allows the model to learn inner representations of 100 languages, useful for various downstream tasks when fine-tuned appropriately.
Guide: Running Locally
-
Install Transformers:
pip install transformers
-
Use the Model:
from transformers import pipeline unmasker = pipeline('fill-mask', model='xlm-roberta-base') result = unmasker("Hello I'm a <mask> model.") print(result)
-
Extract Features with PyTorch:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base') model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
-
Suggested Cloud GPUs: To handle large-scale processing, consider using cloud GPUs from providers like AWS, GCP, or Azure.
License
XLM-RoBERTa is released under the MIT License, allowing for wide usage and adaptation in both academic and commercial projects.