gbyuvd
/

synthaccess-chemselfies

@@ -16,7 +16,7 @@ tags:
 # Model Card for ChemFIE-SA (Synthesis Accessibility)
-This model is a BERT-like sequence classifier for 221 human protein drug targets, fine-tuned from [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) on a dataset derived from ChemBL34 (Zdrazil et al. 2023). It predicts  using chemical structures represented as SELFIES (Self-Referencing Embedded Strings).
 ### Disclaimer: For Academic Purposes Only
@@ -67,12 +67,12 @@ def smiles_to_selfies_sentence(smiles):
         return None
 # Example usage:
-in_smi = "C1CCC2=CN3C=CC4=C5C=CC=CC5=NC4=C3C=C2C1" # Sempervirine (CID168919)
 selfies_sentence = smiles_to_selfies_sentence(in_smi)
 print(selfies_sentence)
 """
-[C] [C] [C] [C] [=C] [N] [C] [=C] [C] [=C] [C] [=C] [C] [=C] [C] [Ring1] [=Branch1] [=N] [C] [Ring1] [=Branch2] [=C] [Ring1] [=N] [C] [=C] [Ring1] [P] [C] [Ring2] [Ring1] [Branch1]
 """
@@ -86,88 +86,138 @@ You can also use pipeline:
 from transformers import pipeline
 classifier = pipeline("text-classification", model="gbyuvd/synthaccess-chemselfies")
-classifier("[C] [C] [C] [C] [=C] [N] [C] [=C] [C] [=C] [C] [=C] [C] [=C] [C] [Ring1] [=Branch1] [=N] [C] [Ring1] [=Branch2] [=C] [Ring1] [=N] [C] [=C] [Ring1] [P] [C] [Ring2] [Ring1] [Branch1]") #Sempervirine (CID168919)
-#
 ```
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## Training Details
 ### Training Data
 ##### Data Sources
-##### Data Preparation
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
 ## Model Examination
 You can visualize its attention heads using [BertViz](https://github.com/jessevig/bertviz) and attribution weights using [Captum](https://captum.ai/) - as [done in the base model](gbyuvd/chemselfies-base-bertmlm) in Interpretability section.
-### Model Architecture and Objective
-[More Information Needed]
 ### Compute Infrastructure
 #### Hardware

 # Model Card for ChemFIE-SA (Synthesis Accessibility)
+This model is a BERT-like sequence classifier for predicting synthesis accessibility given a SELFIES string of a compound, fine-tuned from [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) on a DeepSA expanded train dataset (Wang et al. 2023).
 ### Disclaimer: For Academic Purposes Only
         return None
 # Example usage:
+in_smi = "C1CCC(CC1)(CC(=O)O)CN" # Gabapentin (CID3446)
 selfies_sentence = smiles_to_selfies_sentence(in_smi)
 print(selfies_sentence)
 """
+[C] [C] [C] [C] [Branch1] [Branch1] [C] [C] [Ring1] [=Branch1] [Branch1] [#Branch1] [C] [C] [=Branch1] [C] [=O] [O] [C] [N]
 """
 from transformers import pipeline
 classifier = pipeline("text-classification", model="gbyuvd/synthaccess-chemselfies")
+classifier("[C] [C] [C] [C] [Branch1] [Branch1] [C] [C] [Ring1] [=Branch1] [Branch1] [#Branch1] [C] [C] [=Branch1] [C] [=O] [O] [C] [N]") # Gabapentin
+# [{'label': 'Easy', 'score': 0.9187200665473938}]
 ```
 ## Training Details
 ### Training Data
 ##### Data Sources
+Training data is fetched from [DeepSA's repository](https://github.com/Shihang-Wang-58/DeepSA).
+##### Data Preparation
+- SMILES is converted into SELFIES
+- Chunked into three parts to accommodate Paperspace's Gradient 6hrs limit.
+- Then the data was split by 90:10 ratio of train:validation.
+  - 1st chunk size: 1,197,683 (1,077,915 train : 119,768 validation)
+- The data contain labels for:
+  - 0: Easy synthesis (requires less than 10 steps)
+  - 1: Hard synthesis (requires more than 10 steps)
 ### Training Procedure
 #### Training Hyperparameters
+- Epoch = 1 for each chunk
+- Batch size = 128
+- Number of steps for each chunk: 8422
+I am using Ranger21 with these configuration:
+```
+Ranger21 optimizer ready with following settings:
+Core optimizer = [madgrad](https://arxiv.org/abs/2101.11075)
+Learning rate of 5e-06
+Important - num_epochs of training = ** 1 epochs **
+using AdaBelief for variance computation
+Warm-up: linear warmup, over 2000 iterations
+Lookahead active, merging every 5 steps, with blend factor of 0.5
+Norm Loss active, factor = 0.0001
+Stable weight decay of 0.01
+Gradient Centralization = On
+Adaptive Gradient Clipping = True
+	clipping value of 0.01
+	steps for clipping = 0.001
+```
+1st Chunk:
+| Step | Training Loss | Validation Loss | Accuracy | Precision |  Recall  |    F1    | Roc Auc  |
+| :--: | :-----------: | :-------------: | :------: | :-------: | :------: | :------: | :------: |
+| 8420 |   0.128700    |    0.128632     | 0.922860 | 0.975201  | 0.867836 | 0.918391 | 0.990007 |
+## Model Evaluation
+### Testing Data
+The model (currently only trained on the 1st chunk) was evaluated using three distinct test sets provided by DeepSA's authors (Wang et al. 2023) to ensure comprehensive performance assessment across various scenarios:
+1. **Main Expanded Test Set**
+2. **Independent Test Set 1 (TS1)**
+   - Characteristics: Contains ES and HS compounds with high intra-group fingerprint similarity, but significant inter-group pattern differences.
+3. **Independent Test Set 2 (TS2)**
+   - Characteristics: Contains a small portion of ES and HS molecules showing similar fingerprint patterns.
+4. **Independent Test Set 3 (TS3)**
+   - Characteristics: All compounds exhibit high fingerprint similarity, presenting the most challenging classification task.
+### Evaluation Metrics
+We employed a comprehensive set of metrics to evaluate our model's performance:
+1. **Accuracy (ACC)**: Overall correctness of predictions
+2. **Recall**: Ability to identify all relevant instances (sensitivity)
+3. **Precision**: Accuracy of positive predictions
+4. **F1-score**: Harmonic mean of precision and recall
+5. **Area Under the Receiver Operating Characteristic curve (AUROC)**: Model's ability to distinguish between classes
+All metrics were evaluated using a threshold of 0.50 for binary classification.
+### Results
+Below are the detailed results of our model's performance across all test sets:
+#### Expanded Test Set Results
+Comparison data is sourced from Wang et al. (2023), used various models as encoding layer:
+- bert-mini (MinBert)
+- bert-tini (TinBert)
+- roberta-base (RoBERTa)
+- deberta-v3-base (DeBERTa)
+- Chem_GraphCodeBert (GraphCodeBert)
+- electra-small-discriminator (SmELECTRA)
+- ChemBERTa-77M-MTR (ChemMTR)
+- ChemBERTa-77M-MLM (ChemMLM)
+which was trained/fine-tuned to predict based on SMILES - while ChemFIE-SA is SELFIES-based:
+| **Model**            | **Recall** | **Precision** | **F–score** | **AUROC** |
+| -------------------- | :--------: | :-----------: | :---------: | :-------: |
+| DeepSA_DeBERTa       |   0.873    |     0.920     |    0.896    |   0.959   |
+| DeepSA_GraphCodeBert |   0.931    |     0.944     |    0.937    |   0.987   |
+| DeepSA_MinBert       |   0.933    |     0.945     |    0.939    |   0.988   |
+| DeepSA_RoBERTa       |   0.940    |     0.940     |    0.940    |   0.988   |
+| DeepSA_TinBert       |   0.937    |     0.947     |    0.942    |   0.990   |
+| DeepSA_SmELECTRA     |   0.938    |     0.949     |    0.943    |   0.990   |
+| **ChemFIE-SA**       |   0.952    |     0.942     |    0.947    |   0.990   |
+| DeepSA_ChemMLM       |   0.955    |     0.967     |    0.961    |   0.995   |
+| DeepSA_ChemMTR       |   0.968    |     0.974     |    0.971    |   0.997   |
+#### TS1-3 Results
+Comparison with DeepSA_SmELECTRA as described in Wang et al. (2023):
+| Datasets | Model      |  ACC  | Recall | Precision | F-score | AUROC | Threshold |
+| -------- | ---------- | :---: | :----: | :-------: | :-----: | :---: | :-------: |
+| TS1      | DeepSA     | 0.995 | 1.000  |   0.990   |  0.995  | 1.000 |   0.500   |
+|          | ChemFIE-SA | 0.996 | 1.000  |   0.992   |  0.996  | 1.000 |   0.500   |
+| TS2      | DeepSA     | 0.838 | 0.730  |   0.871   |  0.795  | 0.913 |   0.500   |
+|          | ChemFIE-SA | 0.805 | 0.775  |   0.770   |  0.773  | 0.886 |   0.500   |
+| TS3      | DeepSA     | 0.817 | 0.753  |   0.864   |  0.805  | 0.896 |   0.500   |
+|          | ChemFIE-SA | 0.731 | 0.642  |   0.781   |  0.705  | 0.797 |   0.500   |
 ## Model Examination
 You can visualize its attention heads using [BertViz](https://github.com/jessevig/bertviz) and attribution weights using [Captum](https://captum.ai/) - as [done in the base model](gbyuvd/chemselfies-base-bertmlm) in Interpretability section.
 ### Compute Infrastructure
 #### Hardware