Erlangshen Longformer 110 M

IDEA-CCNL

Introduction

The Erlangshen-Longformer-110M is a Chinese language model based on Longformer-base, designed for handling long text sequences. It incorporates a rotating positional encoding mechanism and has 110 million parameters, specifically tailored for processing Chinese texts.

Architecture

The model follows the Longformer-base architecture and is built on the chinese_roformer_L-12_H-768_A-12 foundation. It utilizes rotational position embedding (RoPE) to address the issue of uneven sequence lengths in the pre-training corpus. The model was continually pre-trained on the WuDao corpus, which is 180 GB in size.

Training

The model was trained using the WuDao corpus with a focus on natural language understanding tasks in Chinese. The RoPE method was particularly employed to enhance the model's efficiency in handling long text sequences and to mitigate issues arising from variable-length inputs.

Guide: Running Locally

To run the Erlangshen-Longformer-110M model locally, follow these steps:

  1. Clone the Repository:

    git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
    
  2. Install Dependencies: Ensure that you have the transformers library installed. You may also need PyTorch if it's not already installed.

  3. Load the Model:

    from fengshen import LongformerModel    
    from fengshen import LongformerConfig
    from transformers import BertTokenizer
    
    tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
    config = LongformerConfig.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
    model = LongformerModel.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
    
  4. Optional: Use a cloud GPU service like AWS, Google Cloud, or Azure to enhance performance, especially for large-scale data processing.

License

The Erlangshen-Longformer-110M model is licensed under the Apache-2.0 License, allowing for both personal and commercial use with appropriate attribution.

More Related APIs