xlm roberta large finetuned conll03 english

FacebookAI

Introduction

The XLM-RoBERTa-large model, fine-tuned on the CoNLL-2003 dataset for English, is a multi-lingual language model developed by Facebook AI. It is based on RoBERTa and trained on a vast amount of CommonCrawl data. The model supports token classification tasks such as Named Entity Recognition (NER) and Part-of-Speech tagging.

Architecture

XLM-RoBERTa is a multi-lingual transformer-based language model trained on 100 languages. It leverages a large corpus of 2.5TB of filtered data to capture cross-lingual representations. This specific version is a fine-tuned variant of the XLM-RoBERTa-large model.

Training

The model was trained using the CoNLL-2003 dataset, which is widely used for tasks like NER. For further details about the training data and procedures, refer to the XLM-RoBERTa-large model card and the associated research paper.

Guide: Running Locally

  1. Setup Environment

    • Install the transformers library:
      pip install transformers
      
  2. Load Model and Tokenizer

    • Use the following code to set up the model for NER:
      from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
      
      tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
      model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
      classifier = pipeline("ner", model=model, tokenizer=tokenizer)
      
      result = classifier("Hello I'm Omar and I live in Zürich.")
      print(result)
      
  3. Consider Cloud GPUs

    • For efficient processing, especially with large datasets, consider using cloud services like AWS, GCP, or Azure that offer Nvidia V100 or A100 GPUs.

License

The license details for this model need further clarification. Ensure to review the Hugging Face model card and associated resources for any licensing information before use.

More Related APIs in Token Classification