ch E M B L_smiles_v1
mrm8488Introduction
This project involves training a Masked Language Model (MLM) similar to RoBERTa for de novo drug design. By leveraging machine learning, the model aims to generate plausible SMILES strings representing molecular structures, which can be proposed as new drugs.
Architecture
The model is built using a Masked Language Model (MLM) architecture akin to RoBERTa. It uses SMILES strings, which are textual representations of chemical molecules, as input to learn and generate potential new molecular combinations.
Training
The MLM is trained from scratch on 438,552 cleaned SMILES strings. The cleaning process involves removing duplicates, salts, and stereochemical information using a specific script. The aim is to enable the model to learn molecular patterns that can be used for generating new SMILES sequences.
Guide: Running Locally
-
Install Dependencies: Ensure you have the
transformers
library installed:pip install transformers
-
Initialize the Pipeline: Use the Hugging Face pipeline for fill-mask tasks.
from transformers import pipeline fill_mask = pipeline( "fill-mask", model='mrm8488/chEMBL_smiles_v1', tokenizer='mrm8488/chEMBL_smiles_v1' )
-
Generate SMILES: Input a SMILES string with a masked token to get predictions.
smile1 = "CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)<mask>" result = fill_mask(smile1)
Cloud GPUs: For efficient training and inference, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.
License
The model and code are shared by Manuel Romero under terms specified on the Hugging Face platform. For detailed licensing information, consult the Hugging Face website or repository.