gbyuvd commited on
Commit
b170da8
·
verified ·
1 Parent(s): 6297d4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -48
README.md CHANGED
@@ -16,7 +16,7 @@ tags:
16
 
17
  # Model Card for ChemFIE-SA (Synthesis Accessibility)
18
 
19
- This model is a BERT-like sequence classifier for 221 human protein drug targets, fine-tuned from [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) on a dataset derived from ChemBL34 (Zdrazil et al. 2023). It predicts using chemical structures represented as SELFIES (Self-Referencing Embedded Strings).
20
 
21
 
22
  ### Disclaimer: For Academic Purposes Only
@@ -67,12 +67,12 @@ def smiles_to_selfies_sentence(smiles):
67
  return None
68
 
69
  # Example usage:
70
- in_smi = "C1CCC2=CN3C=CC4=C5C=CC=CC5=NC4=C3C=C2C1" # Sempervirine (CID168919)
71
  selfies_sentence = smiles_to_selfies_sentence(in_smi)
72
  print(selfies_sentence)
73
 
74
  """
75
- [C] [C] [C] [C] [=C] [N] [C] [=C] [C] [=C] [C] [=C] [C] [=C] [C] [Ring1] [=Branch1] [=N] [C] [Ring1] [=Branch2] [=C] [Ring1] [=N] [C] [=C] [Ring1] [P] [C] [Ring2] [Ring1] [Branch1]
76
 
77
  """
78
 
@@ -86,88 +86,138 @@ You can also use pipeline:
86
  from transformers import pipeline
87
 
88
  classifier = pipeline("text-classification", model="gbyuvd/synthaccess-chemselfies")
89
- classifier("[C] [C] [C] [C] [=C] [N] [C] [=C] [C] [=C] [C] [=C] [C] [=C] [C] [Ring1] [=Branch1] [=N] [C] [Ring1] [=Branch2] [=C] [Ring1] [=N] [C] [=C] [Ring1] [P] [C] [Ring2] [Ring1] [Branch1]") #Sempervirine (CID168919)
90
- #
91
 
92
  ```
93
 
94
- ### Out-of-Scope Use
95
-
96
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
97
-
98
- [More Information Needed]
99
-
100
- ## Bias, Risks, and Limitations
101
-
102
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
103
-
104
- [More Information Needed]
105
-
106
- ### Recommendations
107
-
108
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
109
-
110
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
111
-
112
-
113
  ## Training Details
114
 
115
  ### Training Data
 
116
  ##### Data Sources
117
- ##### Data Preparation
118
 
119
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
120
 
121
- [More Information Needed]
 
 
 
 
 
 
122
 
123
  ### Training Procedure
124
 
125
  #### Training Hyperparameters
 
 
 
 
126
 
127
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
128
 
 
 
129
 
130
- ## Evaluation
 
 
131
 
132
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
133
 
134
- ### Testing Data, Factors & Metrics
 
 
 
135
 
136
- #### Testing Data
 
 
 
137
 
138
- <!-- This should link to a Dataset Card if possible. -->
139
 
140
- [More Information Needed]
141
 
142
- #### Factors
143
 
144
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
145
 
146
- [More Information Needed]
 
147
 
148
- #### Metrics
 
149
 
150
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
151
 
152
- [More Information Needed]
153
 
154
- ### Results
155
 
156
- [More Information Needed]
 
 
 
 
157
 
158
- #### Summary
159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
 
162
  ## Model Examination
163
 
164
  You can visualize its attention heads using [BertViz](https://github.com/jessevig/bertviz) and attribution weights using [Captum](https://captum.ai/) - as [done in the base model](gbyuvd/chemselfies-base-bertmlm) in Interpretability section.
165
 
166
-
167
- ### Model Architecture and Objective
168
-
169
- [More Information Needed]
170
-
171
  ### Compute Infrastructure
172
 
173
  #### Hardware
 
16
 
17
  # Model Card for ChemFIE-SA (Synthesis Accessibility)
18
 
19
+ This model is a BERT-like sequence classifier for predicting synthesis accessibility given a SELFIES string of a compound, fine-tuned from [gbyuvd/chemselfies-base-bertmlm](https://huggingface.co/gbyuvd/chemselfies-base-bertmlm) on a DeepSA expanded train dataset (Wang et al. 2023).
20
 
21
 
22
  ### Disclaimer: For Academic Purposes Only
 
67
  return None
68
 
69
  # Example usage:
70
+ in_smi = "C1CCC(CC1)(CC(=O)O)CN" # Gabapentin (CID3446)
71
  selfies_sentence = smiles_to_selfies_sentence(in_smi)
72
  print(selfies_sentence)
73
 
74
  """
75
+ [C] [C] [C] [C] [Branch1] [Branch1] [C] [C] [Ring1] [=Branch1] [Branch1] [#Branch1] [C] [C] [=Branch1] [C] [=O] [O] [C] [N]
76
 
77
  """
78
 
 
86
  from transformers import pipeline
87
 
88
  classifier = pipeline("text-classification", model="gbyuvd/synthaccess-chemselfies")
89
+ classifier("[C] [C] [C] [C] [Branch1] [Branch1] [C] [C] [Ring1] [=Branch1] [Branch1] [#Branch1] [C] [C] [=Branch1] [C] [=O] [O] [C] [N]") # Gabapentin
90
+ # [{'label': 'Easy', 'score': 0.9187200665473938}]
91
 
92
  ```
93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  ## Training Details
95
 
96
  ### Training Data
97
+
98
  ##### Data Sources
 
99
 
100
+ Training data is fetched from [DeepSA's repository](https://github.com/Shihang-Wang-58/DeepSA).
101
+
102
+ ##### Data Preparation
103
 
104
+ - SMILES is converted into SELFIES
105
+ - Chunked into three parts to accommodate Paperspace's Gradient 6hrs limit.
106
+ - Then the data was split by 90:10 ratio of train:validation.
107
+ - 1st chunk size: 1,197,683 (1,077,915 train : 119,768 validation)
108
+ - The data contain labels for:
109
+ - 0: Easy synthesis (requires less than 10 steps)
110
+ - 1: Hard synthesis (requires more than 10 steps)
111
 
112
  ### Training Procedure
113
 
114
  #### Training Hyperparameters
115
+ - Epoch = 1 for each chunk
116
+ - Batch size = 128
117
+ - Number of steps for each chunk: 8422
118
+ I am using Ranger21 with these configuration:
119
 
120
+ ```
121
+ Ranger21 optimizer ready with following settings:
122
 
123
+ Core optimizer = [madgrad](https://arxiv.org/abs/2101.11075)
124
+ Learning rate of 5e-06
125
 
126
+ Important - num_epochs of training = ** 1 epochs **
127
+ using AdaBelief for variance computation
128
+ Warm-up: linear warmup, over 2000 iterations
129
 
130
+ Lookahead active, merging every 5 steps, with blend factor of 0.5
131
+ Norm Loss active, factor = 0.0001
132
+ Stable weight decay of 0.01
133
+ Gradient Centralization = On
134
 
135
+ Adaptive Gradient Clipping = True
136
+ clipping value of 0.01
137
+ steps for clipping = 0.001
138
+ ```
139
 
140
+ 1st Chunk:
141
+ | Step | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 | Roc Auc |
142
+ | :--: | :-----------: | :-------------: | :------: | :-------: | :------: | :------: | :------: |
143
+ | 8420 | 0.128700 | 0.128632 | 0.922860 | 0.975201 | 0.867836 | 0.918391 | 0.990007 |
144
 
 
145
 
146
+ ## Model Evaluation
147
 
148
+ ### Testing Data
149
 
150
+ The model (currently only trained on the 1st chunk) was evaluated using three distinct test sets provided by DeepSA's authors (Wang et al. 2023) to ensure comprehensive performance assessment across various scenarios:
151
+ 1. **Main Expanded Test Set**
152
 
153
+ 2. **Independent Test Set 1 (TS1)**
154
+ - Characteristics: Contains ES and HS compounds with high intra-group fingerprint similarity, but significant inter-group pattern differences.
155
 
156
+ 3. **Independent Test Set 2 (TS2)**
157
+ - Characteristics: Contains a small portion of ES and HS molecules showing similar fingerprint patterns.
158
 
159
+ 4. **Independent Test Set 3 (TS3)**
160
+ - Characteristics: All compounds exhibit high fingerprint similarity, presenting the most challenging classification task.
161
 
162
+ ### Evaluation Metrics
163
 
164
+ We employed a comprehensive set of metrics to evaluate our model's performance:
165
 
166
+ 1. **Accuracy (ACC)**: Overall correctness of predictions
167
+ 2. **Recall**: Ability to identify all relevant instances (sensitivity)
168
+ 3. **Precision**: Accuracy of positive predictions
169
+ 4. **F1-score**: Harmonic mean of precision and recall
170
+ 5. **Area Under the Receiver Operating Characteristic curve (AUROC)**: Model's ability to distinguish between classes
171
 
172
+ All metrics were evaluated using a threshold of 0.50 for binary classification.
173
 
174
+ ### Results
175
+
176
+ Below are the detailed results of our model's performance across all test sets:
177
+
178
+ #### Expanded Test Set Results
179
+ Comparison data is sourced from Wang et al. (2023), used various models as encoding layer:
180
+ - bert-mini (MinBert)
181
+ - bert-tini (TinBert)
182
+ - roberta-base (RoBERTa)
183
+ - deberta-v3-base (DeBERTa)
184
+ - Chem_GraphCodeBert (GraphCodeBert)
185
+ - electra-small-discriminator (SmELECTRA)
186
+ - ChemBERTa-77M-MTR (ChemMTR)
187
+ - ChemBERTa-77M-MLM (ChemMLM)
188
+
189
+ which was trained/fine-tuned to predict based on SMILES - while ChemFIE-SA is SELFIES-based:
190
+
191
+ | **Model** | **Recall** | **Precision** | **F–score** | **AUROC** |
192
+ | -------------------- | :--------: | :-----------: | :---------: | :-------: |
193
+ | DeepSA_DeBERTa | 0.873 | 0.920 | 0.896 | 0.959 |
194
+ | DeepSA_GraphCodeBert | 0.931 | 0.944 | 0.937 | 0.987 |
195
+ | DeepSA_MinBert | 0.933 | 0.945 | 0.939 | 0.988 |
196
+ | DeepSA_RoBERTa | 0.940 | 0.940 | 0.940 | 0.988 |
197
+ | DeepSA_TinBert | 0.937 | 0.947 | 0.942 | 0.990 |
198
+ | DeepSA_SmELECTRA | 0.938 | 0.949 | 0.943 | 0.990 |
199
+ | **ChemFIE-SA** | 0.952 | 0.942 | 0.947 | 0.990 |
200
+ | DeepSA_ChemMLM | 0.955 | 0.967 | 0.961 | 0.995 |
201
+ | DeepSA_ChemMTR | 0.968 | 0.974 | 0.971 | 0.997 |
202
+
203
+ #### TS1-3 Results
204
+
205
+ Comparison with DeepSA_SmELECTRA as described in Wang et al. (2023):
206
+
207
+ | Datasets | Model | ACC | Recall | Precision | F-score | AUROC | Threshold |
208
+ | -------- | ---------- | :---: | :----: | :-------: | :-----: | :---: | :-------: |
209
+ | TS1 | DeepSA | 0.995 | 1.000 | 0.990 | 0.995 | 1.000 | 0.500 |
210
+ | | ChemFIE-SA | 0.996 | 1.000 | 0.992 | 0.996 | 1.000 | 0.500 |
211
+ | TS2 | DeepSA | 0.838 | 0.730 | 0.871 | 0.795 | 0.913 | 0.500 |
212
+ | | ChemFIE-SA | 0.805 | 0.775 | 0.770 | 0.773 | 0.886 | 0.500 |
213
+ | TS3 | DeepSA | 0.817 | 0.753 | 0.864 | 0.805 | 0.896 | 0.500 |
214
+ | | ChemFIE-SA | 0.731 | 0.642 | 0.781 | 0.705 | 0.797 | 0.500 |
215
 
216
 
217
  ## Model Examination
218
 
219
  You can visualize its attention heads using [BertViz](https://github.com/jessevig/bertviz) and attribution weights using [Captum](https://captum.ai/) - as [done in the base model](gbyuvd/chemselfies-base-bertmlm) in Interpretability section.
220
 
 
 
 
 
 
221
  ### Compute Infrastructure
222
 
223
  #### Hardware