pubmedbert base embeddings 500 K
NeuMLIntroduction
PUBMEDBERT-BASE-EMBEDDINGS-500K is a pruned version of the PubMedBERT Embeddings model, designed for efficient embedding generation and semantic search applications. It retains the top 25% of the most frequently used tokens, making it a compact and efficient model for tasks such as retrieval-augmented generation (RAG).
Architecture
This model utilizes a vocabulary-pruned version of PubMedBERT, reducing the vocabulary size to focus on frequently used tokens. The embeddings are stored with int16 precision, advantageous for smaller or lower-powered devices, potentially speeding up vectorization processes.
Training
The model was trained by pruning the vocabulary of a larger PubMedBERT model. This process involves selecting the most common tokens based on frequency and re-weighting them. PCA is applied to reduce the dimensionality of embeddings, which are then adjusted and normalized. The model is built using tools like Model2Vec and Tokenlearn.
Guide: Running Locally
-
Install Dependencies: Ensure you have Python and necessary libraries like
txtai
,sentence-transformers
, ormodel2vec
installed. -
Load the Model:
- Using
txtai
:import txtai embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-500K", content=True)
- Using
sentence-transformers
:from sentence_transformers import SentenceTransformer, StaticEmbedding static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-500K") model = SentenceTransformer(modules=[static])
- Using
model2vec
:from model2vec import StaticModel model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-500K")
- Using
-
Create Embeddings: Index documents or encode sentences to generate embeddings.
-
GPU Recommendation: For optimal performance, consider using a cloud GPU like an NVIDIA RTX 3090, especially for large datasets.
License
The PUBMEDBERT-BASE-EMBEDDINGS-500K model is licensed under the Apache-2.0 License.