ch E M B L_smiles_v1 LLM Model

Introduction

This project involves training a Masked Language Model (MLM) similar to RoBERTa for de novo drug design. By leveraging machine learning, the model aims to generate plausible SMILES strings representing molecular structures, which can be proposed as new drugs.

Architecture

The model is built using a Masked Language Model (MLM) architecture akin to RoBERTa. It uses SMILES strings, which are textual representations of chemical molecules, as input to learn and generate potential new molecular combinations.

Training

The MLM is trained from scratch on 438,552 cleaned SMILES strings. The cleaning process involves removing duplicates, salts, and stereochemical information using a specific script. The aim is to enable the model to learn molecular patterns that can be used for generating new SMILES sequences.

Guide: Running Locally

Install Dependencies: Ensure you have the transformers library installed:
```
pip install transformers
```

Initialize the Pipeline: Use the Hugging Face pipeline for fill-mask tasks.

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model='mrm8488/chEMBL_smiles_v1',
    tokenizer='mrm8488/chEMBL_smiles_v1'
)

Generate SMILES: Input a SMILES string with a masked token to get predictions.

smile1 = "CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)<mask>"
result = fill_mask(smile1)

Cloud GPUs: For efficient training and inference, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.

License

The model and code are shared by Manuel Romero under terms specified on the Hugging Face platform. For detailed licensing information, consult the Hugging Face website or repository.

More Related APIs in Fill Mask