xlm roberta large
FacebookAIIntroduction
XLM-RoBERTa is a multilingual variant of the RoBERTa model, pre-trained on 2.5TB of filtered CommonCrawl data across 100 languages. Based on the paper "Unsupervised Cross-lingual Representation Learning at Scale" by Conneau et al., it is designed for tasks involving masked language modeling and can be fine-tuned for various downstream applications such as sequence classification and token classification.
Architecture
XLM-RoBERTa employs a Transformer-based architecture, utilizing masked language modeling (MLM) as its pre-training objective. This approach masks 15% of the input words and requires the model to predict the masked words, enabling it to learn bidirectional representations. The model is suitable for extracting multilingual features from text data.
Training
The model was trained in a self-supervised manner using a substantial multilingual dataset from CommonCrawl. Its training involves the MLM task, allowing it to learn contextual embeddings that are effective for various languages and tasks without requiring manually labeled data.
Guide: Running Locally
To run XLM-RoBERTa locally, follow these steps:
-
Install Transformers Library: Ensure you have the Hugging Face Transformers library installed.
pip install transformers
-
Load the Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large') model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-large")
-
Prepare Input Data:
text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt')
-
Perform Inference:
output = model(**encoded_input)
For optimal performance, consider using cloud-based GPUs such as those offered by AWS, GCP, or Azure.
License
XLM-RoBERTa is released under the MIT License, allowing for wide usage and distribution, including modification and commercial use.