pubmedbert base embeddings 100 K
NeuMLIntroduction
The pubmedbert-base-embeddings-100K
is a pruned version of the larger PubMedBERT Embeddings 2M
model, designed to retain only the top 5% most frequently used tokens. It is intended for use in tasks such as semantic search and retrieval-augmented generation (RAG).
Architecture
This model is based on the PubMedBERT architecture and employs vocabulary pruning to maintain a smaller model size while attempting to preserve performance. It is compatible with several libraries, including txtai
, sentence-transformers
, and Model2Vec
, and supports English language processing.
Training
The vocabulary pruning process was conducted using a script that tokenizes data, calculates token weights using a scoring index, and applies PCA for dimensionality reduction. The model's embeddings are re-weighted and normalized during training. The training process leverages resources from the Minish Lab team, utilizing tools like Model2Vec
and Tokenlearn
.
Guide: Running Locally
Basic Steps
- Install Dependencies: Ensure you have the necessary Python libraries installed, such as
txtai
,sentence-transformers
, orModel2Vec
. - Load the Model:
- For
txtai
:import txtai embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-100K", content=True)
- For
sentence-transformers
:from sentence_transformers import SentenceTransformer model = SentenceTransformer.from_pretrained("neuml/pubmedbert-base-embeddings-100K")
- For
Model2Vec
:from model2vec import StaticModel model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-100K")
- For
- Run Queries: Use the model to encode sentences or perform searches according to your needs.
Suggest Cloud GPUs
For improved performance, especially with larger datasets, consider using cloud GPU services such as AWS EC2 with GPU instances, Google Cloud Platform with NVIDIA GPUs, or Azure's GPU-optimized VM instances.
License
The pubmedbert-base-embeddings-100K
model is available under the Apache 2.0 License, allowing for free use, modification, and distribution of the software.