pubmedbert base embeddings 500 K

NeuML

Introduction

PUBMEDBERT-BASE-EMBEDDINGS-500K is a pruned version of the PubMedBERT Embeddings model, designed for efficient embedding generation and semantic search applications. It retains the top 25% of the most frequently used tokens, making it a compact and efficient model for tasks such as retrieval-augmented generation (RAG).

Architecture

This model utilizes a vocabulary-pruned version of PubMedBERT, reducing the vocabulary size to focus on frequently used tokens. The embeddings are stored with int16 precision, advantageous for smaller or lower-powered devices, potentially speeding up vectorization processes.

Training

The model was trained by pruning the vocabulary of a larger PubMedBERT model. This process involves selecting the most common tokens based on frequency and re-weighting them. PCA is applied to reduce the dimensionality of embeddings, which are then adjusted and normalized. The model is built using tools like Model2Vec and Tokenlearn.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and necessary libraries like txtai, sentence-transformers, or model2vec installed.

  2. Load the Model:

    • Using txtai:
      import txtai
      embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-500K", content=True)
      
    • Using sentence-transformers:
      from sentence_transformers import SentenceTransformer, StaticEmbedding
      static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-500K")
      model = SentenceTransformer(modules=[static])
      
    • Using model2vec:
      from model2vec import StaticModel
      model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-500K")
      
  3. Create Embeddings: Index documents or encode sentences to generate embeddings.

  4. GPU Recommendation: For optimal performance, consider using a cloud GPU like an NVIDIA RTX 3090, especially for large datasets.

License

The PUBMEDBERT-BASE-EMBEDDINGS-500K model is licensed under the Apache-2.0 License.

More Related APIs in Sentence Similarity