xlm roberta base multilingual text genre classifier

classla

Introduction

The XLM-RoBERTa-Base Multilingual Text Genre Classifier is a text classification model based on the xlm-roberta-base architecture, fine-tuned on the multilingual X-GENRE genre dataset. It is designed for automatic genre identification across 94 languages and is especially robust for enriching large text collections with genre labels.

Architecture

The model utilizes the XLM-RoBERTa architecture, developed by Facebook AI, which is a transformer-based model pre-trained on a large corpus of multilingual data. The X-GENRE classifier has been fine-tuned specifically for genre detection using a manually-annotated dataset.

Training

The model was fine-tuned using the X-GENRE dataset, focusing on improving its performance in both in-dataset and cross-dataset scenarios. It was compared against other models including GPT-4, GPT-3.5-turbo, SVM, logistic regression, and others, showing superior results, particularly in out-of-dataset scenarios.

Guide: Running Locally

To run the model locally using PyTorch and the simpletransformers library, follow these steps:

  1. Install Dependencies: Ensure Python and PyTorch are installed. Use pip to install simpletransformers.

    pip install simpletransformers
    
  2. Load the Model: Use the following Python code to initialize and run predictions.

    from simpletransformers.classification import ClassificationModel
    
    model_args = {
        "num_train_epochs": 15,
        "learning_rate": 1e-5,
        "max_seq_length": 512,
        "silent": True
    }
    
    model = ClassificationModel(
        "xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True, args=model_args
    )
    
    predictions, _ = model.predict(["Your text here", "Another text here"])
    
  3. GPU Recommendation: For better performance, especially with large datasets, consider using cloud GPUs such as those available on AWS, Google Cloud, or Azure.

License

The XLM-RoBERTa-Base Multilingual Text Genre Classifier is licensed under the CC-BY-SA-4.0 license. This license allows for adaptation and sharing under similar terms, with appropriate credit given to the original authors.

More Related APIs in Text Classification