mbart_ru_sum_gazeta
IlyaGusevIntroduction
The MBART_RU_SUM_GAZETA is a model tailored for summarizing Russian news articles, particularly from Gazeta.ru. It is built using the mBART architecture and offers capabilities for text-to-text generation.
Architecture
The model is a ported version of a fairseq model specifically designed for summarization tasks. It utilizes the mBART architecture, which is well-suited for multilingual text processing.
Training
Training Data
- Dataset: The model was trained on the Gazeta dataset, a collection of Russian news articles.
Training Procedure
- Script: Utilized a Fairseq training script available in the repository.
- Porting: Conducted porting using a Colab notebook, making it accessible for further experimentation and deployment.
Evaluation
The model was evaluated using metrics like ROUGE (R-1-f, R-2-f, R-L-f), chrF, METEOR, and BLEU. The evaluation demonstrated competitive performance, particularly on Gazeta.ru articles.
Guide: Running Locally
-
Setup:
- Install the
transformers
library from Hugging Face. - Install PyTorch if not already done.
- Install the
-
Code Example:
from transformers import MBartTokenizer, MBartForConditionalGeneration model_name = "IlyaGusev/mbart_ru_sum_gazeta" tokenizer = MBartTokenizer.from_pretrained(model_name) model = MBartForConditionalGeneration.from_pretrained(model_name) article_text = "..." input_ids = tokenizer( [article_text], max_length=600, padding="max_length", truncation=True, return_tensors="pt" )["input_ids"] output_ids = model.generate( input_ids=input_ids, no_repeat_ngram_size=4 )[0] summary = tokenizer.decode(output_ids, skip_special_tokens=True) print(summary)
-
Hardware Recommendation:
- For efficient processing, using cloud GPUs such as those from AWS, GCP, or Azure is recommended.
License
The MBART_RU_SUM_GAZETA model is released under the Apache-2.0 license, allowing for both personal and commercial use.