mbart_ru_sum_gazeta

IlyaGusev

Introduction

The MBART_RU_SUM_GAZETA is a model tailored for summarizing Russian news articles, particularly from Gazeta.ru. It is built using the mBART architecture and offers capabilities for text-to-text generation.

Architecture

The model is a ported version of a fairseq model specifically designed for summarization tasks. It utilizes the mBART architecture, which is well-suited for multilingual text processing.

Training

Training Data

  • Dataset: The model was trained on the Gazeta dataset, a collection of Russian news articles.

Training Procedure

  • Script: Utilized a Fairseq training script available in the repository.
  • Porting: Conducted porting using a Colab notebook, making it accessible for further experimentation and deployment.

Evaluation

The model was evaluated using metrics like ROUGE (R-1-f, R-2-f, R-L-f), chrF, METEOR, and BLEU. The evaluation demonstrated competitive performance, particularly on Gazeta.ru articles.

Guide: Running Locally

  1. Setup:

    • Install the transformers library from Hugging Face.
    • Install PyTorch if not already done.
  2. Code Example:

    from transformers import MBartTokenizer, MBartForConditionalGeneration
    
    model_name = "IlyaGusev/mbart_ru_sum_gazeta"
    tokenizer = MBartTokenizer.from_pretrained(model_name)
    model = MBartForConditionalGeneration.from_pretrained(model_name)
    
    article_text = "..."
    
    input_ids = tokenizer(
        [article_text],
        max_length=600,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )["input_ids"]
    
    output_ids = model.generate(
        input_ids=input_ids,
        no_repeat_ngram_size=4
    )[0]
    
    summary = tokenizer.decode(output_ids, skip_special_tokens=True)
    print(summary)
    
  3. Hardware Recommendation:

    • For efficient processing, using cloud GPUs such as those from AWS, GCP, or Azure is recommended.

License

The MBART_RU_SUM_GAZETA model is released under the Apache-2.0 license, allowing for both personal and commercial use.

More Related APIs in Summarization