materials.mhg ged LLM Model

Introduction

We present MHG-GNN, an autoencoder architecture featuring an encoder based on Graph Neural Networks (GNN) and a decoder utilizing a sequential model with Molecular Hypergraph Grammar (MHG). This design enables MHG-GNN to accept any molecular input and offers high predictive performance on molecular graph data. The decoder ensures the generation of structurally valid molecules.

Architecture

MHG-GNN consists of two main components:

Encoder: Utilizes a variant of GNN to process molecular graph data.
Decoder: Based on MHG, it ensures the output is always a structurally valid molecule.

Training

Pre-trained models of MHG-GNN are available, trained on a dataset of approximately 1.34 million molecules from PubChem. The training environment has been tested on Intel E5-2667 CPUs and NVIDIA A100 Tensor Core GPUs.

Guide: Running Locally

Installation:

Create and activate a virtual environment:

python3 -m venv .venv
. .venv/bin/activate

Clone the repository and install dependencies:

git clone git@github.ibm.com:CMD-TRL/mhg-gnn.git
cd ./mhg-gnn
pip install .

Feature Extraction:
- Use the example notebook mhg-gnn_encoder_decoder_example.ipynb for loading checkpoints and using the model.
- Load the model with:
```
import torch
import load

model = load.load()
```
- Encode SMILES strings into embeddings:
```
with torch.no_grad():
    repr = model.encode(["CCO", "O=C=O", "OC(=O)c1ccccc1C(=O)O"])
```
- Decode embeddings back into SMILES strings:
```
orig = model.decode(repr)
```
Suggested Cloud GPUs:
- Consider using NVIDIA A100 Tensor Core GPUs for optimal performance during training and inference.

License

This project is licensed under the Apache 2.0 License.

More Related APIs in Feature Extraction