multilingual sarcasm detector
helinivanIntroduction
The Multilingual Sarcasm Detector is a text classification model designed to detect sarcasm in news article titles. It is fine-tuned on bert-base-multilingual-uncased
and uses datasets from Kaggle and various newspapers in English, Dutch, and Italian. The labels used are 0
for "Not Sarcastic" and 1
for "Sarcastic."
Architecture
The model is based on the bert-base-multilingual-uncased
architecture, leveraging its capacity to handle multiple languages for the task of sarcasm detection. This architecture allows the model to process inputs in English, Dutch, and Italian effectively.
Training
The training data is sourced from various datasets:
- English: Kaggle's News Headlines Dataset For Sarcasm Detection.
- Dutch: Non-sarcastic data from Kaggle's Dutch News Articles and sarcastic news from De Speld.
- Italian: Non-sarcastic news from Il Giornale and sarcastic news from Lercio.
The consolidated dataset for multilingual sarcasm detection can be found at helinivan/sarcasm_headlines_multilingual
.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install the Transformers library:
pip install transformers
-
Load the model and tokenizer:
from transformers import AutoModelForSequenceClassification, AutoTokenizer import string def preprocess_data(text: str) -> str: return text.lower().translate(str.maketrans("", "", string.punctuation)).strip() MODEL_PATH = "helinivan/multilingual-sarcasm-detector" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
-
Prepare and classify your text:
text = "CIA Realizes It's Been Using Black Highlighters All These Years." tokenized_text = tokenizer([preprocess_data(text)], padding=True, truncation=True, max_length=256, return_tensors="pt") output = model(**tokenized_text) probs = output.logits.softmax(dim=-1).tolist()[0] confidence = max(probs) prediction = probs.index(confidence) results = {"is_sarcastic": prediction, "confidence": confidence}
-
Output the results:
print(results)
For large-scale inference, consider using cloud GPUs such as AWS EC2 with GPU instances, Google Cloud Platform, or Azure to expedite processing.
License
The model and its codebase are available in the official repository here. Please refer to the repository for specific licensing terms.