gte en mlm large

Alibaba-NLP

Introduction

The GTE-EN-MLM-LARGE model is part of the GTE-v1.5 series, designed as a generalized text encoder for embedding and reranking tasks. Developed by Alibaba's Institute for Intelligent Computing, it supports long context lengths up to 8192, utilizing a transformer++ encoder backbone. This model is described in detail in the paper "mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval."

Architecture

GTE-EN-MLM-LARGE is based on a modified transformer architecture that incorporates BERT, RoPE, and GLU. It uses the vocabulary of the bert-base-uncased model and is optimized for long-sequence processing.

Training

Training Data

The model was trained using masked language modeling (MLM) on the c4-en dataset.

Training Procedure

The training involved a multi-stage approach to extend the model's context length capability:

  • MLM-512: Learning rate of 2e-4, MLM probability of 0.3, batch size of 4096, 300,000 steps, RoPE base of 10,000.
  • MLM-2048: Learning rate of 5e-5, MLM probability of 0.3, batch size of 4096, 30,000 steps, RoPE base of 10,000.
  • MLM-8192: Learning rate of 5e-5, MLM probability of 0.3, batch size of 1024, 30,000 steps, RoPE base of 160,000.

Guide: Running Locally

  1. Setup Environment: Install necessary libraries such as transformers and torch.
  2. Download Model: Use Hugging Face's transformers library to download the GTE-EN-MLM-LARGE model.
  3. Load Model: Load the model in your script for inference or fine-tuning tasks.

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The GTE-EN-MLM-LARGE model is licensed under the Apache 2.0 License, allowing for both personal and commercial use with appropriate attribution.

More Related APIs in Fill Mask