finance embeddings investopedia LLM Model

Introduction

The FINANCE-EMBEDDINGS-INVESTOPEDIA model by FinLang is an embedding model designed for finance applications. It is trained using a finance dataset from Hugging Face, specifically for mapping sentences and paragraphs to a dense vector space of 768 dimensions. This model is suitable for tasks such as clustering and semantic search in Retrieval-Augmented Generation (RAG) applications.

Architecture

The model is a finetuned version based on the BAAI/bge-base-en-v1.5 architecture. It focuses on sentence and paragraph embeddings and is optimized for finance-related tasks.

Training

This model is trained on an open-sourced finance dataset from Hugging Face. It is finetuned to ensure proficiency in finance-related tasks while maintaining the ability to generalize to other domains. The team plans to release a v2 version with a larger corpus and improved training techniques.

Guide: Running Locally

Requirements:

Libraries: Ensure you have sentence-transformers installed.
```
pip install -U sentence-transformers
```

Example Usage:

Import the SentenceTransformer library and load the model:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('FinLang/investopedia_embedding')

Encode sentences to obtain embeddings:

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings)

To compare sentence similarity:

query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys?"
query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency."
embedding_1 = model.encode(query_1)
embedding_2 = model.encode(query_2)
scores = (embedding_1 * embedding_2).sum()
print(scores)

Cloud GPUs:

For enhanced performance, consider using cloud-based GPUs such as those from AWS, Google Cloud, or Azure.

License

This model is released under the cc-by-nc-4.0 license. It is intended for research purposes, with the stipulation that it cannot be used for commercial purposes. Additional terms may apply to third-party datasets.

More Related APIs in Sentence Similarity