Chem B E R Ta zinc base v1
seyonecIntroduction
ChemBERTa is a BERT-like transformer model designed for masked language modeling of chemical SMILES strings. This effort represents a pioneering application of deep learning to chemistry and materials science, leveraging transfer learning techniques popular in NLP and computer vision. The model was trained using Hugging Face's tools and a ByteLevel tokenizer on a dataset of 100,000 SMILES strings from the ZINC database.
Architecture
ChemBERTa employs the RoBERTa architecture, which is a robustly optimized BERT approach. The model is specifically tailored to understand and predict chemical SMILES sequences. This setup allows the model to generate predictions for tokens within a sequence, simulating possible variants of a molecule within a defined chemical space.
Training
The training process for ChemBERTa involved running the RoBERTa architecture for five epochs, achieving a loss of 0.398. It is suggested that extending the training duration could further reduce the loss. The model's ability to learn representations of functional groups and atoms is crucial for addressing issues such as toxicity, solubility, drug-likeness, and synthetic accessibility. These learned representations can be utilized in graph convolution and attention models for molecular graph structures or fine-tuning BERT.
Guide: Running Locally
- Clone the repository to access training, uploading, and evaluation notebooks.
- Transfer the notebooks to a Colab runtime for seamless execution.
- Train and evaluate the model on your dataset or use the pre-trained model for predictions.
Suggested Cloud GPUs
- Google Colab
- NVIDIA GPUs on AWS
- Google Cloud Platform's AI Platform
License
The model and associated resources follow open-source licensing to promote further research and application in computational chemistry.