Pathology B E R T
tsantosIntroduction
PathologyBERT is a pre-trained masked language model specifically developed for the pathology domain, particularly focusing on breast pathology specimens. It addresses the limitations of general-domain language models like BERT in handling domain-specific terminology by creating a specialized vocabulary.
Architecture
PathologyBERT is based on the BERT architecture, utilizing a masked language modeling approach to predict masked words in a sentence. It uses Word-Pieces for input tokenization but highlights issues with this method when dealing with specialized vocabulary in pathology.
Training
The model was pre-trained using a batch size of 32, a maximum sequence length of 64, a masked language model probability of 0.15, and a learning rate of 2e-5. Training was conducted for 300,000 steps, using BERT's default parameters.
Guide: Running Locally
To run PathologyBERT locally, you can use the Hugging Face Transformers library. Here's a basic guide:
-
Install the Transformers library:
pip install transformers
-
Use the model with a pipeline for masked language modeling:
from transformers import pipeline language_model = pipeline('fill-mask', model='tsantos/PathologyBERT') result = language_model("intraductal papilloma with [MASK] AND MICRO calcifications")
-
Analyze the output to interpret the model's predictions.
For better performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure for computation-intensive tasks.
License
For licensing information, please refer to the original Hugging Face repository or contact the author via email at thiagogyn.maia@gmail.com.