xlm roberta base language detection
paplucaIntroduction
The XLM-RoBERTa-Base-Language-Detection model is a fine-tuned version of the XLM-RoBERTa transformer, equipped with a classification head for language identification tasks across 20 languages. It is designed primarily for sequence classification to detect languages in multilingual datasets.
Architecture
The model uses the XLM-RoBERTa transformer architecture with an additional linear classification layer on top of the pooled output. This setup enables it to perform language detection efficiently.
Training
The model was fine-tuned using the Language Identification dataset, comprising 70k training samples and 10k samples each for validation and testing. It achieved an accuracy of 99.6% on the test set, with precision and recall metrics detailed for each supported language. Fine-tuning was conducted using the Trainer API with specific hyperparameters such as a learning rate of 2e-05 and a train batch size of 64. The training process involved two epochs, and the use of the Adam optimizer.
Guide: Running Locally
To run the model locally, you can use either the high-level pipeline API or handle the tokenizer and model separately. Here are the steps:
-
Install the Transformers Library:
pip install transformers torch
-
Using the Pipeline API:
from transformers import pipeline text = ["Brevity is the soul of wit.", "Amor, ch'a nullo amato amar perdona."] model_ckpt = "papluca/xlm-roberta-base-language-detection" pipe = pipeline("text-classification", model=model_ckpt) results = pipe(text, top_k=1, truncation=True)
-
Using Tokenizer and Model Separately:
import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer text = ["Brevity is the soul of wit.", "Amor, ch'a nullo amato amar perdona."] model_ckpt = "papluca/xlm-roberta-base-language-detection" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModelForSequenceClassification.from_pretrained(model_ckpt) inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits preds = torch.softmax(logits, dim=-1)
-
Cloud GPUs: For efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud Platform, or Azure to run the model.
License
This model is licensed under the MIT License, which allows for flexibility in terms of usage, modification, and distribution.