twitter_sexismo finetuned exist2021 metwo

somosnlp-hackathon-2022

Introduction

The TWITTER_SEXISMO-FINETUNED-EXIST2021 model is designed for text classification to detect sexism in Spanish tweets. It is a fine-tuned version of the pysentimiento/robertuito-hate-speech model. This project was developed during the 'Somos NLP' Hackathon by a team of contributors. The model achieves an accuracy of 0.83 on the evaluation data.

Architecture

The model uses the RoBERTa architecture, specifically fine-tuned for detecting sexist content in tweets. It is compatible with PyTorch and can be deployed using Hugging Face's inference endpoints.

Training

Training Data

The model was trained on the EXIST dataset and the MeTwo Machismo and Sexism Twitter Identification dataset. These datasets focus on identifying sexist expressions or related phenomena in Spanish-language tweets.

Training Procedure

The training process utilized the following hyperparameters:

  • Learning rate: 5E-5
  • Adam epsilon: 1E-8
  • Number of epochs: 8
  • Warmup steps: 3
  • Mini-batch size: 32
  • Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
  • Learning rate scheduler type: linear

The training achieved a loss of 0.54 and accuracy of 0.83. The framework versions used were Transformers 4.17.0, Pytorch 1.10.0+cu111, and Tokenizers 0.11.6.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the required libraries:

    pip install transformers
    
  2. Load the model using the Transformers library:

    from transformers import pipeline
    
    model_checkpoint = "hackathon-pln-es/twitter_sexismo-finetuned-exist2021-metwo" 
    pipeline_nlp = pipeline("text-classification", model=model_checkpoint)
    
    # Example usage
    print(pipeline_nlp("mujer al volante peligro!"))
    
  3. Consider using cloud GPUs for more efficient processing, such as those offered by Google Cloud, AWS, or Azure, especially if dealing with large datasets or requiring quicker inference times.

License

The model and associated resources are released under the Apache 2.0 license, allowing for both commercial and non-commercial use, with attribution.

More Related APIs in Text Classification