multilingual sarcasm detector

helinivan

Introduction

The Multilingual Sarcasm Detector is a text classification model designed to detect sarcasm in news article titles. It is fine-tuned on bert-base-multilingual-uncased and uses datasets from Kaggle and various newspapers in English, Dutch, and Italian. The labels used are 0 for "Not Sarcastic" and 1 for "Sarcastic."

Architecture

The model is based on the bert-base-multilingual-uncased architecture, leveraging its capacity to handle multiple languages for the task of sarcasm detection. This architecture allows the model to process inputs in English, Dutch, and Italian effectively.

Training

The training data is sourced from various datasets:

  • English: Kaggle's News Headlines Dataset For Sarcasm Detection.
  • Dutch: Non-sarcastic data from Kaggle's Dutch News Articles and sarcastic news from De Speld.
  • Italian: Non-sarcastic news from Il Giornale and sarcastic news from Lercio.

The consolidated dataset for multilingual sarcasm detection can be found at helinivan/sarcasm_headlines_multilingual.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load the model and tokenizer:

    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    import string
    
    def preprocess_data(text: str) -> str:
        return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()
    
    MODEL_PATH = "helinivan/multilingual-sarcasm-detector"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
    
  3. Prepare and classify your text:

    text = "CIA Realizes It's Been Using Black Highlighters All These Years."
    tokenized_text = tokenizer([preprocess_data(text)], padding=True, truncation=True, max_length=256, return_tensors="pt")
    output = model(**tokenized_text)
    probs = output.logits.softmax(dim=-1).tolist()[0]
    confidence = max(probs)
    prediction = probs.index(confidence)
    results = {"is_sarcastic": prediction, "confidence": confidence}
    
  4. Output the results:

    print(results)
    

For large-scale inference, consider using cloud GPUs such as AWS EC2 with GPU instances, Google Cloud Platform, or Azure to expedite processing.

License

The model and its codebase are available in the official repository here. Please refer to the repository for specific licensing terms.

More Related APIs in Text Classification