finance embeddings investopedia
FinLangIntroduction
The FINANCE-EMBEDDINGS-INVESTOPEDIA model by FinLang is an embedding model designed for finance applications. It is trained using a finance dataset from Hugging Face, specifically for mapping sentences and paragraphs to a dense vector space of 768 dimensions. This model is suitable for tasks such as clustering and semantic search in Retrieval-Augmented Generation (RAG) applications.
Architecture
The model is a finetuned version based on the BAAI/bge-base-en-v1.5 architecture. It focuses on sentence and paragraph embeddings and is optimized for finance-related tasks.
Training
This model is trained on an open-sourced finance dataset from Hugging Face. It is finetuned to ensure proficiency in finance-related tasks while maintaining the ability to generalize to other domains. The team plans to release a v2 version with a larger corpus and improved training techniques.
Guide: Running Locally
Requirements:
- Libraries: Ensure you have
sentence-transformers
installed.pip install -U sentence-transformers
Example Usage:
-
Import the
SentenceTransformer
library and load the model:from sentence_transformers import SentenceTransformer model = SentenceTransformer('FinLang/investopedia_embedding')
-
Encode sentences to obtain embeddings:
sentences = ["This is an example sentence", "Each sentence is converted"] embeddings = model.encode(sentences) print(embeddings)
-
To compare sentence similarity:
query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys?" query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency." embedding_1 = model.encode(query_1) embedding_2 = model.encode(query_2) scores = (embedding_1 * embedding_2).sum() print(scores)
Cloud GPUs:
For enhanced performance, consider using cloud-based GPUs such as those from AWS, Google Cloud, or Azure.
License
This model is released under the cc-by-nc-4.0 license. It is intended for research purposes, with the stipulation that it cannot be used for commercial purposes. Additional terms may apply to third-party datasets.