pubmedbert base embeddings 100 K LLM Model

Introduction

The pubmedbert-base-embeddings-100K is a pruned version of the larger PubMedBERT Embeddings 2M model, designed to retain only the top 5% most frequently used tokens. It is intended for use in tasks such as semantic search and retrieval-augmented generation (RAG).

Architecture

This model is based on the PubMedBERT architecture and employs vocabulary pruning to maintain a smaller model size while attempting to preserve performance. It is compatible with several libraries, including txtai, sentence-transformers, and Model2Vec, and supports English language processing.

Training

The vocabulary pruning process was conducted using a script that tokenizes data, calculates token weights using a scoring index, and applies PCA for dimensionality reduction. The model's embeddings are re-weighted and normalized during training. The training process leverages resources from the Minish Lab team, utilizing tools like Model2Vec and Tokenlearn.

Guide: Running Locally

Basic Steps

Install Dependencies: Ensure you have the necessary Python libraries installed, such as txtai, sentence-transformers, or Model2Vec.

Load the Model:

For txtai:

import txtai
embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-100K", content=True)

For sentence-transformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer.from_pretrained("neuml/pubmedbert-base-embeddings-100K")

For Model2Vec:

from model2vec import StaticModel
model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-100K")

Run Queries: Use the model to encode sentences or perform searches according to your needs.

Suggest Cloud GPUs

For improved performance, especially with larger datasets, consider using cloud GPU services such as AWS EC2 with GPU instances, Google Cloud Platform with NVIDIA GPUs, or Azure's GPU-optimized VM instances.

License

The pubmedbert-base-embeddings-100K model is available under the Apache 2.0 License, allowing for free use, modification, and distribution of the software.

More Related APIs in Sentence Similarity