Smiles Tokenizer_ Pub Chem_1 M
DeepChemIntroduction
The SmilesTokenizer_PubChem_1M model is a RoBERTa-based model developed for feature extraction from chemical compound representations known as SMILES. It is trained on a subset of 1 million SMILES strings from the PubChem 77M dataset within MoleculeNet.
Architecture
This model utilizes the RoBERTa architecture, a robustly optimized BERT approach that enhances the masked language model pre-training. It is specifically tailored to handle SMILES strings, a notation for representing chemical structures.
Training
The model underwent training using a vast collection of SMILES strings, which are line notations for describing the structure of chemical molecules. The dataset used, sourced from PubChem, contained 1 million entries, allowing the model to learn diverse chemical representations effectively.
Guide: Running Locally
- Setup Environment: Ensure Python and PyTorch are installed on your machine. Install the Hugging Face Transformers library using pip:
pip install transformers
- Download the Model: Clone the model repository or download it directly from Hugging Face's model hub.
- Run Inference: Utilize the model in your Python script to perform feature extraction on SMILES strings:
from transformers import RobertaTokenizer, RobertaModel tokenizer = RobertaTokenizer.from_pretrained('DeepChem/SmilesTokenizer_PubChem_1M') model = RobertaModel.from_pretrained('DeepChem/SmilesTokenizer_PubChem_1M') inputs = tokenizer("CCO", return_tensors='pt') # Example SMILES outputs = model(**inputs)
- Cloud GPU Recommendation: For more efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure.
License
The SmilesTokenizer_PubChem_1M model is available under the MIT License, permitting unrestricted use, distribution, and adaptation, provided proper attribution is maintained.