xlm roberta base language detection

papluca

Introduction

The XLM-RoBERTa-Base-Language-Detection model is a fine-tuned version of the XLM-RoBERTa transformer, equipped with a classification head for language identification tasks across 20 languages. It is designed primarily for sequence classification to detect languages in multilingual datasets.

Architecture

The model uses the XLM-RoBERTa transformer architecture with an additional linear classification layer on top of the pooled output. This setup enables it to perform language detection efficiently.

Training

The model was fine-tuned using the Language Identification dataset, comprising 70k training samples and 10k samples each for validation and testing. It achieved an accuracy of 99.6% on the test set, with precision and recall metrics detailed for each supported language. Fine-tuning was conducted using the Trainer API with specific hyperparameters such as a learning rate of 2e-05 and a train batch size of 64. The training process involved two epochs, and the use of the Adam optimizer.

Guide: Running Locally

To run the model locally, you can use either the high-level pipeline API or handle the tokenizer and model separately. Here are the steps:

  1. Install the Transformers Library:

    pip install transformers torch
    
  2. Using the Pipeline API:

    from transformers import pipeline
    
    text = ["Brevity is the soul of wit.", "Amor, ch'a nullo amato amar perdona."]
    model_ckpt = "papluca/xlm-roberta-base-language-detection"
    pipe = pipeline("text-classification", model=model_ckpt)
    results = pipe(text, top_k=1, truncation=True)
    
  3. Using Tokenizer and Model Separately:

    import torch
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    
    text = ["Brevity is the soul of wit.", "Amor, ch'a nullo amato amar perdona."]
    model_ckpt = "papluca/xlm-roberta-base-language-detection"
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
    
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    preds = torch.softmax(logits, dim=-1)
    
  4. Cloud GPUs: For efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure to run the model.

License

This model is licensed under the MIT License, which allows for flexibility in terms of usage, modification, and distribution.

More Related APIs in Text Classification