Smiles Tokenizer_ Pub Chem_1 M

DeepChem

Introduction
The SmilesTokenizer_PubChem_1M model is a RoBERTa-based model developed for feature extraction from chemical compound representations known as SMILES. It is trained on a subset of 1 million SMILES strings from the PubChem 77M dataset within MoleculeNet.

Architecture
This model utilizes the RoBERTa architecture, a robustly optimized BERT approach that enhances the masked language model pre-training. It is specifically tailored to handle SMILES strings, a notation for representing chemical structures.

Training
The model underwent training using a vast collection of SMILES strings, which are line notations for describing the structure of chemical molecules. The dataset used, sourced from PubChem, contained 1 million entries, allowing the model to learn diverse chemical representations effectively.

Guide: Running Locally

  1. Setup Environment: Ensure Python and PyTorch are installed on your machine. Install the Hugging Face Transformers library using pip:
    pip install transformers
    
  2. Download the Model: Clone the model repository or download it directly from Hugging Face's model hub.
  3. Run Inference: Utilize the model in your Python script to perform feature extraction on SMILES strings:
    from transformers import RobertaTokenizer, RobertaModel
    
    tokenizer = RobertaTokenizer.from_pretrained('DeepChem/SmilesTokenizer_PubChem_1M')
    model = RobertaModel.from_pretrained('DeepChem/SmilesTokenizer_PubChem_1M')
    
    inputs = tokenizer("CCO", return_tensors='pt')  # Example SMILES
    outputs = model(**inputs)
    
  4. Cloud GPU Recommendation: For more efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure.

License
The SmilesTokenizer_PubChem_1M model is available under the MIT License, permitting unrestricted use, distribution, and adaptation, provided proper attribution is maintained.

More Related APIs in Feature Extraction