medbert base chinese
truetoIntroduction
MEDBERT is an open-source project that explores the application of BERT models in Chinese clinical natural language processing. It involves the development and evaluation of models like MedBERT and MedAlbert, trained on extensive Chinese clinical text datasets.
Architecture
MEDBERT is based on BERT and Albert models. These models are pre-trained on 650 million characters of Chinese clinical natural language text to create MedBERT and MedAlbert. The datasets used include CEMRNER, CMTNER, CMedQQ, and CCTC, which aid in tasks like named entity recognition, sentence pair recognition, and sentence classification.
Training
The models are trained on various datasets:
- CEMRNER: Chinese Electronic Medical Record Named Entity Recognition
- CMTNER: Chinese Medical Text Named Entity Recognition
- CMedQQ: Chinese Medical Question-Question recognition
- CCTC: Chinese Clinical Text Classification
Performance comparisons show that MedBERT and its variants outperform other models in several tasks, demonstrating superior results in named entity recognition and sentence recognition tasks.
Guide: Running Locally
- Clone the Repository: Clone the MEDBERT repository from GitHub.
- Install Dependencies: Ensure you have PyTorch and Transformers libraries installed.
- Download Model Weights: Obtain the pre-trained MedBERT model weights from the repository.
- Run the Model: Use the model on your local machine for inference or further training.
For efficient training and inference, consider using cloud GPUs such as those provided by AWS EC2, Google Cloud, or Azure.
License
The MEDBERT project is open-source and available on GitHub. Ensure to review the repository for specific licensing details.