finance embeddings investopedia

FinLang

Introduction

The FINANCE-EMBEDDINGS-INVESTOPEDIA model by FinLang is an embedding model designed for finance applications. It is trained using a finance dataset from Hugging Face, specifically for mapping sentences and paragraphs to a dense vector space of 768 dimensions. This model is suitable for tasks such as clustering and semantic search in Retrieval-Augmented Generation (RAG) applications.

Architecture

The model is a finetuned version based on the BAAI/bge-base-en-v1.5 architecture. It focuses on sentence and paragraph embeddings and is optimized for finance-related tasks.

Training

This model is trained on an open-sourced finance dataset from Hugging Face. It is finetuned to ensure proficiency in finance-related tasks while maintaining the ability to generalize to other domains. The team plans to release a v2 version with a larger corpus and improved training techniques.

Guide: Running Locally

Requirements:

  • Libraries: Ensure you have sentence-transformers installed.
    pip install -U sentence-transformers
    

Example Usage:

  1. Import the SentenceTransformer library and load the model:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('FinLang/investopedia_embedding')
    
  2. Encode sentences to obtain embeddings:

    sentences = ["This is an example sentence", "Each sentence is converted"]
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. To compare sentence similarity:

    query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys?"
    query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency."
    embedding_1 = model.encode(query_1)
    embedding_2 = model.encode(query_2)
    scores = (embedding_1 * embedding_2).sum()
    print(scores)
    

Cloud GPUs:

For enhanced performance, consider using cloud-based GPUs such as those from AWS, Google Cloud, or Azure.

License

This model is released under the cc-by-nc-4.0 license. It is intended for research purposes, with the stipulation that it cannot be used for commercial purposes. Additional terms may apply to third-party datasets.

More Related APIs in Sentence Similarity