chemical bert uncased
recoboIntroduction
Chemical-BERT-Uncased is a BERT-based language model tailored for the chemical industry. It is further pre-trained from SciBERT, utilizing a substantial corpus of over 40,000 technical documents and 13,000 Wikipedia articles related to chemistry. This model is designed to enhance understanding and processing of chemical domain texts, including Safety Data Sheets and Product Information Documents.
Architecture
Chemical-BERT-Uncased employs a BERT architecture, specifically optimized for the chemical sector. It uses masked language modeling (MLM) to predict masked words in sentences, allowing for the learning of bidirectional representations. This differs from RNNs and autoregressive models by providing a comprehensive understanding of the context within a sentence.
Training
The model was trained using over 250,000 tokens from the chemical domain, with a training dataset encompassing more than 9.2 million paragraphs. The MLM approach involves masking 15% of the words in an input sentence, requiring the model to predict these masked words, thereby enhancing its contextual comprehension capabilities.
Guide: Running Locally
To use Chemical-BERT-Uncased locally, follow these steps:
-
Install the Transformers library:
pip install transformers
-
Load the model and pipeline:
from transformers import pipeline fill_mask = pipeline( "fill-mask", model="recobo/chemical-bert-uncased", tokenizer="recobo/chemical-bert-uncased" )
-
Run a sample inference:
fill_mask("we create [MASK]")
For efficient processing, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
Chemical-BERT-Uncased is available under the Hugging Face model sharing license, which provides guidelines on usage and distribution. Always ensure compliance with the license terms when using the model.