Smiles Tokenizer_ Pub Chem_1 M LLM Model

Introduction
The SmilesTokenizer_PubChem_1M model is a RoBERTa-based model developed for feature extraction from chemical compound representations known as SMILES. It is trained on a subset of 1 million SMILES strings from the PubChem 77M dataset within MoleculeNet.

Architecture
This model utilizes the RoBERTa architecture, a robustly optimized BERT approach that enhances the masked language model pre-training. It is specifically tailored to handle SMILES strings, a notation for representing chemical structures.

Training
The model underwent training using a vast collection of SMILES strings, which are line notations for describing the structure of chemical molecules. The dataset used, sourced from PubChem, contained 1 million entries, allowing the model to learn diverse chemical representations effectively.

Guide: Running Locally

Setup Environment: Ensure Python and PyTorch are installed on your machine. Install the Hugging Face Transformers library using pip:
```
pip install transformers
```
Download the Model: Clone the model repository or download it directly from Hugging Face's model hub.

Run Inference: Utilize the model in your Python script to perform feature extraction on SMILES strings:

from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained('DeepChem/SmilesTokenizer_PubChem_1M')
model = RobertaModel.from_pretrained('DeepChem/SmilesTokenizer_PubChem_1M')

inputs = tokenizer("CCO", return_tensors='pt')  # Example SMILES
outputs = model(**inputs)

Cloud GPU Recommendation: For more efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure.

License
The SmilesTokenizer_PubChem_1M model is available under the MIT License, permitting unrestricted use, distribution, and adaptation, provided proper attribution is maintained.

More Related APIs in Feature Extraction