pubmedbert base embeddings 500 K LLM Model

Introduction

PUBMEDBERT-BASE-EMBEDDINGS-500K is a pruned version of the PubMedBERT Embeddings model, designed for efficient embedding generation and semantic search applications. It retains the top 25% of the most frequently used tokens, making it a compact and efficient model for tasks such as retrieval-augmented generation (RAG).

Architecture

This model utilizes a vocabulary-pruned version of PubMedBERT, reducing the vocabulary size to focus on frequently used tokens. The embeddings are stored with int16 precision, advantageous for smaller or lower-powered devices, potentially speeding up vectorization processes.

Training

The model was trained by pruning the vocabulary of a larger PubMedBERT model. This process involves selecting the most common tokens based on frequency and re-weighting them. PCA is applied to reduce the dimensionality of embeddings, which are then adjusted and normalized. The model is built using tools like Model2Vec and Tokenlearn.

Guide: Running Locally

Install Dependencies: Ensure you have Python and necessary libraries like txtai, sentence-transformers, or model2vec installed.

Load the Model:

Using txtai:

import txtai
embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-500K", content=True)

Using sentence-transformers:

from sentence_transformers import SentenceTransformer, StaticEmbedding
static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-500K")
model = SentenceTransformer(modules=[static])

Using model2vec:

from model2vec import StaticModel
model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-500K")

Create Embeddings: Index documents or encode sentences to generate embeddings.
GPU Recommendation: For optimal performance, consider using a cloud GPU like an NVIDIA RTX 3090, especially for large datasets.

License

The PUBMEDBERT-BASE-EMBEDDINGS-500K model is licensed under the Apache-2.0 License.

More Related APIs in Sentence Similarity