xlm roberta large

FacebookAI

Introduction

XLM-RoBERTa is a multilingual variant of the RoBERTa model, pre-trained on 2.5TB of filtered CommonCrawl data across 100 languages. Based on the paper "Unsupervised Cross-lingual Representation Learning at Scale" by Conneau et al., it is designed for tasks involving masked language modeling and can be fine-tuned for various downstream applications such as sequence classification and token classification.

Architecture

XLM-RoBERTa employs a Transformer-based architecture, utilizing masked language modeling (MLM) as its pre-training objective. This approach masks 15% of the input words and requires the model to predict the masked words, enabling it to learn bidirectional representations. The model is suitable for extracting multilingual features from text data.

Training

The model was trained in a self-supervised manner using a substantial multilingual dataset from CommonCrawl. Its training involves the MLM task, allowing it to learn contextual embeddings that are effective for various languages and tasks without requiring manually labeled data.

Guide: Running Locally

To run XLM-RoBERTa locally, follow these steps:

  1. Install Transformers Library: Ensure you have the Hugging Face Transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
    model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-large")
    
  3. Prepare Input Data:

    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    
  4. Perform Inference:

    output = model(**encoded_input)
    

For optimal performance, consider using cloud-based GPUs such as those offered by AWS, GCP, or Azure.

License

XLM-RoBERTa is released under the MIT License, allowing for wide usage and distribution, including modification and commercial use.

More Related APIs in Fill Mask