Ara T5 base title generation

UBC-NLP

Introduction

The AraT5-base-title-generation model is part of a suite of Arabic-specific text-to-text Transformer models developed for enhanced Arabic language understanding and generation. These models are designed to tackle various Arabic linguistic tasks, such as machine translation, text summarization, title and question generation, paraphrasing, and transliteration.

Architecture

AraT5 models use the T5 (Text-to-Text Transfer Transformer) framework, which reformulates all tasks into a text-to-text format. The models specifically cater to Modern Standard Arabic (MSA), Arabic dialects, and social media content like tweets.

Training

The training of AraT5 models involved extensive pre-training on multiple Arabic datasets to create robust models for various linguistic tasks. The models outperformed the multilingual T5 (mT5) in a wide range of Arabic language tasks, despite being trained with significantly less data.

Guide: Running Locally

To run the AraT5-base-title-generation model locally, follow these steps:

  1. Install Dependencies: Ensure that Python and the Hugging Face Transformers library are installed.
  2. Load the Model:
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-base-title-generation")
    model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-base-title-generation")
    
  3. Prepare Input: Encode your document with the tokenizer.
    encoding = tokenizer.encode_plus("Your Arabic text here", pad_to_max_length=True, return_tensors="pt")
    input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]
    
  4. Generate Output:
    outputs = model.generate(
        input_ids=input_ids, attention_mask=attention_masks,
        max_length=256, do_sample=True, top_k=120, top_p=0.95,
        early_stopping=True, num_return_sequences=5
    )
    
  5. Decode and Display Titles:
    for id, output in enumerate(outputs):
        title = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        print("title#" + str(id), title)
    

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure, especially for large-scale inference.

License

The AraT5 models are available for research purposes on the Hugging Face platform. For commercial use, users must contact the authors for explicit permission. The models are provided under a research-friendly license, encouraging academic and non-commercial utilization.

More Related APIs in Text2text Generation