m T5_m2o_chinese_simplified_cross Sum

csebuetnlp

Introduction

The mT5_M2O_CHINESE_SIMPLIFIED_CROSSSUM model is a many-to-one multilingual T5 checkpoint finetuned on various cross-lingual pairs using the CrossSum dataset. It aims to summarize text from any language into Simplified Chinese.

Architecture

The model is based on the mT5 architecture, which is a multilingual variant of the T5 model designed for text-to-text tasks. It can handle text inputs in 43 different languages, summarizing them into Simplified Chinese.

Training

The model was finetuned using the CrossSum dataset, which includes cross-lingual pairs with target summaries in Simplified Chinese. Detailed training scripts and methodologies can be found in the associated research paper and the official repository linked in the documentation.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Transformers Library: Ensure you have the transformers library installed. It was tested on version 4.11.0.dev0.

    pip install transformers
    
  2. Import Required Modules:

    import re
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
  3. Define a Whitespace Handler:

    WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))
    
  4. Prepare Text for Summarization:

    article_text = """Your text here"""
    
  5. Load the Model and Tokenizer:

    model_name = "csebuetnlp/mT5_m2o_chinese_simplified_crossSum"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
  6. Tokenize and Generate Summary:

    input_ids = tokenizer(
        [WHITESPACE_HANDLER(article_text)],
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )["input_ids"]
    
    output_ids = model.generate(
        input_ids=input_ids,
        max_length=84,
        no_repeat_ngram_size=2,
        num_beams=4
    )[0]
    
    summary = tokenizer.decode(
        output_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    
    print(summary)
    

For optimal performance, consider using cloud GPUs such as AWS, GCP, or Azure.

License

This model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (cc-by-nc-sa-4.0).

More Related APIs in Summarization