Erlangshen Longformer 110 M
IDEA-CCNLIntroduction
The Erlangshen-Longformer-110M is a Chinese language model based on Longformer-base, designed for handling long text sequences. It incorporates a rotating positional encoding mechanism and has 110 million parameters, specifically tailored for processing Chinese texts.
Architecture
The model follows the Longformer-base architecture and is built on the chinese_roformer_L-12_H-768_A-12
foundation. It utilizes rotational position embedding (RoPE) to address the issue of uneven sequence lengths in the pre-training corpus. The model was continually pre-trained on the WuDao corpus, which is 180 GB in size.
Training
The model was trained using the WuDao corpus with a focus on natural language understanding tasks in Chinese. The RoPE method was particularly employed to enhance the model's efficiency in handling long text sequences and to mitigate issues arising from variable-length inputs.
Guide: Running Locally
To run the Erlangshen-Longformer-110M model locally, follow these steps:
-
Clone the Repository:
git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
-
Install Dependencies: Ensure that you have the
transformers
library installed. You may also need PyTorch if it's not already installed. -
Load the Model:
from fengshen import LongformerModel from fengshen import LongformerConfig from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M") config = LongformerConfig.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M") model = LongformerModel.from_pretrained("IDEA-CCNL/Erlangshen-Longformer-110M")
-
Optional: Use a cloud GPU service like AWS, Google Cloud, or Azure to enhance performance, especially for large-scale data processing.
License
The Erlangshen-Longformer-110M model is licensed under the Apache-2.0 License, allowing for both personal and commercial use with appropriate attribution.