Fin Modern B E R T R A G embed base
sujet-aiIntroduction
The Fin-ModernBERT-RAG-embed-base model by Sujet-AI is a Sentence Transformer model fine-tuned on financial datasets for semantic textual similarity and feature extraction tasks. It is designed to map sentences to a 768-dimensional dense vector space, enabling semantic search, paraphrase mining, and text classification.
Architecture
The model is based on the nomic-ai/modernbert-embed-base
and supports a maximum sequence length of 8192 tokens. Its architecture includes:
- Transformer: Handles text data processing with a focus on long sequences, leveraging ModernBertModel.
- Pooling Layer: Averages token embeddings to create a single vector representation for each input sentence.
- Normalization Layer: Ensures consistent output across different sentence lengths.
Training
The model was fine-tuned using the sujet-financial-rag-en-dataset
, consisting of 104,601 training samples, for tasks such as sentence similarity. The MultipleNegativesRankingLoss function was used during training, with a learning rate of 0.0002
and a batch size of 64
. The training process involved two epochs, utilizing a cosine learning rate scheduler and AdamW optimizer.
Guide: Running Locally
-
Install Sentence Transformers Library:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sujet-ai/Fin-ModernBERT-RAG-base") sentences = ["Example sentence 1", "Example sentence 2"] embeddings = model.encode(sentences) print(embeddings.shape)
-
Cloud GPU: For enhanced performance, consider using cloud services like AWS, Google Cloud, or Azure, where GPUs such as NVIDIA's Tesla V100 or A100 can significantly accelerate model inference.
License
The model and dataset are available under terms specified in Hugging Face's licensing agreements. Ensure compliance with the specific licensing terms when using the model in applications.