xlm roberta base multilingual text genre classifier
classlaIntroduction
The XLM-RoBERTa-Base Multilingual Text Genre Classifier is a text classification model based on the xlm-roberta-base architecture, fine-tuned on the multilingual X-GENRE genre dataset. It is designed for automatic genre identification across 94 languages and is especially robust for enriching large text collections with genre labels.
Architecture
The model utilizes the XLM-RoBERTa architecture, developed by Facebook AI, which is a transformer-based model pre-trained on a large corpus of multilingual data. The X-GENRE classifier has been fine-tuned specifically for genre detection using a manually-annotated dataset.
Training
The model was fine-tuned using the X-GENRE dataset, focusing on improving its performance in both in-dataset and cross-dataset scenarios. It was compared against other models including GPT-4, GPT-3.5-turbo, SVM, logistic regression, and others, showing superior results, particularly in out-of-dataset scenarios.
Guide: Running Locally
To run the model locally using PyTorch and the simpletransformers
library, follow these steps:
-
Install Dependencies: Ensure Python and PyTorch are installed. Use pip to install
simpletransformers
.pip install simpletransformers
-
Load the Model: Use the following Python code to initialize and run predictions.
from simpletransformers.classification import ClassificationModel model_args = { "num_train_epochs": 15, "learning_rate": 1e-5, "max_seq_length": 512, "silent": True } model = ClassificationModel( "xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True, args=model_args ) predictions, _ = model.predict(["Your text here", "Another text here"])
-
GPU Recommendation: For better performance, especially with large datasets, consider using cloud GPUs such as those available on AWS, Google Cloud, or Azure.
License
The XLM-RoBERTa-Base Multilingual Text Genre Classifier is licensed under the CC-BY-SA-4.0 license. This license allows for adaptation and sharing under similar terms, with appropriate credit given to the original authors.