gte en mlm large
Alibaba-NLPIntroduction
The GTE-EN-MLM-LARGE model is part of the GTE-v1.5 series, designed as a generalized text encoder for embedding and reranking tasks. Developed by Alibaba's Institute for Intelligent Computing, it supports long context lengths up to 8192, utilizing a transformer++ encoder backbone. This model is described in detail in the paper "mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval."
Architecture
GTE-EN-MLM-LARGE is based on a modified transformer architecture that incorporates BERT, RoPE, and GLU. It uses the vocabulary of the bert-base-uncased model and is optimized for long-sequence processing.
Training
Training Data
The model was trained using masked language modeling (MLM) on the c4-en dataset.
Training Procedure
The training involved a multi-stage approach to extend the model's context length capability:
- MLM-512: Learning rate of 2e-4, MLM probability of 0.3, batch size of 4096, 300,000 steps, RoPE base of 10,000.
- MLM-2048: Learning rate of 5e-5, MLM probability of 0.3, batch size of 4096, 30,000 steps, RoPE base of 10,000.
- MLM-8192: Learning rate of 5e-5, MLM probability of 0.3, batch size of 1024, 30,000 steps, RoPE base of 160,000.
Guide: Running Locally
- Setup Environment: Install necessary libraries such as
transformers
andtorch
. - Download Model: Use Hugging Face's
transformers
library to download the GTE-EN-MLM-LARGE model. - Load Model: Load the model in your script for inference or fine-tuning tasks.
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The GTE-EN-MLM-LARGE model is licensed under the Apache 2.0 License, allowing for both personal and commercial use with appropriate attribution.