all_datasets_v3_roberta large
flax-sentence-embeddingsAll Datasets V3 Roberta-Large
Introduction
The project focuses on training sentence embedding models using a self-supervised contrastive learning objective on large datasets. It fine-tunes the pretrained roberta-large
model on a dataset comprising 1 billion sentence pairs. The model is designed to serve as a sentence encoder, producing a vector capturing the semantic information of a sentence for tasks such as information retrieval, clustering, or sentence similarity.
Architecture
The pretrained roberta-large
model is utilized as the base architecture, which is then fine-tuned using contrastive learning on a vast dataset of sentence pairs. The model's training and fine-tuning leverage efficient hardware and are implemented using JAX/Flax frameworks.
Training
Pre-training
The initial phase involves using the existing roberta-large
model. The details of the pre-training process can be found in the model's card.
Fine-tuning
Fine-tuning is conducted using a contrastive learning approach. The cosine similarity between sentence pairs is computed, and cross-entropy loss is applied to distinguish true pairs from random ones. The training utilized a TPU v3-8 over 540,000 steps with a batch size of 1024, using the AdamW optimizer with a learning rate of 2e-5.
Training Data
The model was fine-tuned using a diverse array of datasets, totaling over 1 billion sentence pairs. These datasets include GOOAQ, Stack Exchange, Flickr 30k, COCO 2020, and many others. Each dataset's contribution was determined by a weighted probability.
Guide: Running Locally
To run this model locally, follow these steps:
-
Set up Environment: Ensure you have Python and the necessary libraries installed. Recommended libraries include
sentence-transformers
. -
Install SentenceTransformers:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_roberta-large') text = "Replace me by any text you'd like." text_embedding = model.encode(text)
-
Hardware Recommendation: For efficient performance, especially on large datasets, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The model and its associated resources are subject to the licensing agreements as detailed on the Hugging Face platform. Ensure compliance with all licensing terms when using, modifying, or distributing the model and its outputs.