m T5_multilingual_ X L Sum

csebuetnlp

Introduction

The mT5_multilingual_XLSum model is a multilingual summarization model based on the mT5 architecture. It has been fine-tuned on the XL-Sum dataset, which encompasses 45 languages, to perform text summarization tasks.

Architecture

The model is a variant of the mT5, which is a multilingual version of Google's T5 (Text-to-Text Transfer Transformer) model. mT5 is designed to handle text-to-text transformations across multiple languages, making it suitable for tasks like summarization in diverse linguistic contexts.

Training

The model was trained using the XL-Sum dataset. Detailed training scripts and methodologies are documented in the associated research paper and the official GitHub repository. The model performance is evaluated using standard metrics like ROUGE scores across different languages.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Transformers Library: Ensure you have the transformers library installed.

    pip install transformers
    
  2. Load the Model and Tokenizer: Use the transformers library to load the model and tokenizer.

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    model_name = "csebuetnlp/mT5_multilingual_XLSum"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
  3. Prepare Input Text: Handle whitespace and prepare your input text.

    import re
    
    WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))
    article_text = "Your text here..."
    
  4. Generate Summary: Tokenize the input and generate a summary.

    input_ids = tokenizer(
        [WHITESPACE_HANDLER(article_text)],
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )["input_ids"]
    
    output_ids = model.generate(
        input_ids=input_ids,
        max_length=84,
        no_repeat_ngram_size=2,
        num_beams=4
    )[0]
    
    summary = tokenizer.decode(
        output_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    
    print(summary)
    
  5. Cloud GPUs: For performance improvements, especially with large datasets, consider using cloud-based GPU platforms like AWS EC2, Google Cloud Platform, or Azure.

License

The model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license allows sharing and adapting the model for non-commercial purposes, provided appropriate credit is given and adaptations are shared under similar terms.

More Related APIs in Summarization