twitter_sexismo finetuned exist2021 metwo
somosnlp-hackathon-2022Introduction
The TWITTER_SEXISMO-FINETUNED-EXIST2021
model is designed for text classification to detect sexism in Spanish tweets. It is a fine-tuned version of the pysentimiento/robertuito-hate-speech
model. This project was developed during the 'Somos NLP' Hackathon by a team of contributors. The model achieves an accuracy of 0.83 on the evaluation data.
Architecture
The model uses the RoBERTa architecture, specifically fine-tuned for detecting sexist content in tweets. It is compatible with PyTorch and can be deployed using Hugging Face's inference endpoints.
Training
Training Data
The model was trained on the EXIST dataset and the MeTwo Machismo and Sexism Twitter Identification dataset. These datasets focus on identifying sexist expressions or related phenomena in Spanish-language tweets.
Training Procedure
The training process utilized the following hyperparameters:
- Learning rate: 5E-5
- Adam epsilon: 1E-8
- Number of epochs: 8
- Warmup steps: 3
- Mini-batch size: 32
- Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler type: linear
The training achieved a loss of 0.54 and accuracy of 0.83. The framework versions used were Transformers 4.17.0, Pytorch 1.10.0+cu111, and Tokenizers 0.11.6.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install the required libraries:
pip install transformers
-
Load the model using the Transformers library:
from transformers import pipeline model_checkpoint = "hackathon-pln-es/twitter_sexismo-finetuned-exist2021-metwo" pipeline_nlp = pipeline("text-classification", model=model_checkpoint) # Example usage print(pipeline_nlp("mujer al volante peligro!"))
-
Consider using cloud GPUs for more efficient processing, such as those offered by Google Cloud, AWS, or Azure, especially if dealing with large datasets or requiring quicker inference times.
License
The model and associated resources are released under the Apache 2.0 license, allowing for both commercial and non-commercial use, with attribution.