multilingual toxic xlm roberta
unitaryIntroduction
The multilingual-toxic-xlm-roberta
model by Unitary is designed for text classification, specifically targeting toxic comment classification across multiple languages. It leverages PyTorch and Transformers libraries and aims to address challenges in detecting toxicity, unintended biases, and multilingual toxic comments.
Architecture
The model is based on the XLM-RoBERTa architecture, which is a transformer model suitable for multilingual tasks. It uses pre-trained language models to identify and classify toxic content in text, addressing various toxic comment classification challenges.
Training
Training involves datasets from three Jigsaw challenges:
- Toxic Comment Classification (2018): Aimed at detecting different types of toxicity.
- Unintended Bias in Toxicity Classification (2019): Focused on minimizing bias with identity mentions.
- Multilingual Toxic Comment Classification (2020): Targets multilingual toxic comments using combined datasets.
Training requires the Kaggle API for dataset downloads and follows a multi-stage process, especially for multilingual tasks.
Guide: Running Locally
-
Clone the Project:
git clone https://github.com/unitaryai/detoxify cd detoxify
-
Set Up Environment:
python3 -m venv toxic-env source toxic-env/bin/activate
-
Install Dependencies:
pip install -e detoxify pip install -r requirements.txt
-
Run Predictions:
python run_prediction.py --input 'example' --model_name original
-
Training (Optional): Download datasets using the Kaggle API and run training scripts with specified configurations.
Cloud GPUs: To expedite training and inference, consider using cloud services like AWS, GCP, or Azure for GPU resources.
License
The model is licensed under the Apache 2.0 License, allowing wide use and modification, provided that any derivative works also comply with the same license terms.