lhallee commited on
Commit
fbaff3f
·
verified ·
1 Parent(s): a4c7998

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +64 -2
README.md CHANGED
@@ -15,6 +15,7 @@ tags: []
15
  - [Training](#training)
16
  - [Evaluation](#evaluation)
17
  - [Results](#results)
 
18
  - [Cite](#cite)
19
 
20
  ## Introduction
@@ -238,8 +239,8 @@ Difference is statistically significant (p < 0.05)
238
 
239
  1. **Clone the repository:**
240
  ```bash
241
- git clone <repository-url>
242
- cd <repository-name>
243
  ```
244
 
245
  2. **Initialize the submodules:**
@@ -275,6 +276,16 @@ Difference is statistically significant (p < 0.05)
275
  deactivate
276
  ```
277
 
 
 
 
 
 
 
 
 
 
 
278
  ## Training
279
 
280
  The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/datasets/Synthyra/omg_prot50).
@@ -430,6 +441,57 @@ DSM demonstrates strong performance in both protein sequence generation and repr
430
 
431
  These results highlight DSM's capability to unify high-quality protein representation learning and biologically coherent generative modeling within a single framework.
432
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
433
  ## Cite
434
  ```
435
  @misc{hallee2025diffusionsequencemodelsenhanced,
 
15
  - [Training](#training)
16
  - [Evaluation](#evaluation)
17
  - [Results](#results)
18
+ - [Experimental validation](#experimental-validation)
19
  - [Cite](#cite)
20
 
21
  ## Introduction
 
239
 
240
  1. **Clone the repository:**
241
  ```bash
242
+ git clone https://github.com/Gleghorn-Lab/DSM.git
243
+ cd DSM
244
  ```
245
 
246
  2. **Initialize the submodules:**
 
276
  deactivate
277
  ```
278
 
279
+ All together
280
+ ```bash
281
+ git clone https://github.com/Gleghorn-Lab/DSM.git
282
+ cd DSM
283
+ git submodule update --init --remote --recursive
284
+ chmod +x setup_bioenv.sh
285
+ ./setup_bioenv.sh
286
+ source ~/bioenv/bin/activate
287
+ ```
288
+
289
  ## Training
290
 
291
  The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/datasets/Synthyra/omg_prot50).
 
441
 
442
  These results highlight DSM's capability to unify high-quality protein representation learning and biologically coherent generative modeling within a single framework.
443
 
444
+ ## Experimental validation
445
+
446
+ We validated various DSM generated binders using biolayer interferometry through [Adaptyv Bio](https://www.adaptyvbio.com/) - sending in 20 designs for EGFR and PD-L1. Sequences generated via unconditional generation were sorted hierarchically by predicted binding affinity (Synteract2), then ESMfold pLDDT, ESM2 PLL, and finally Chai1 iPTM.
447
+
448
+ ### EGFR
449
+
450
+ Of the 13 expressed, 12 designs bound to EGFR with 11 of them binding strongly. Noteably, the top design, `dsm_egfr_10`, presented a mean KD in the picomolar range (861 pM), which is a ~30% increase in binding affinity vs. the winner (and our starting template) of the Adaptyv EGFR competition at 1.21 nM, and ~90% over the original starting scFV Cetuximab at 664 nM.
451
+
452
+ <img width="1593" height="527" alt="image" src="https://github.com/user-attachments/assets/f6f6a614-d5f4-4e9e-b7c9-fc0fdf9dc1e2" />
453
+
454
+
455
+ **dsm_egfr_10**
456
+ ```
457
+ QVQLQQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLEWLGVIWSGGNTDYNTPFTSRLSISRDTSKSQVFFKMNSLQTDDTAVYYCARALTYYDYEFAYWGQGTLVTVSAGGGGSGGGGSGGGGSDILLTQSPVILSVSPGERVSFSCRASQSIGSNIHWYQQRTNGSPKLLIRYASESISGIPSRFSGSGSGTDFTLSINSVDPEDIADYYCQQNNNWPTTFGAGTKLEIK
458
+ ```
459
+ * KD - 861 pM
460
+ * pKD - 9.06
461
+ * PPI probability (Synteract2) - 0.9991
462
+ * predicted pKD (Synteract2) - 8.94
463
+ * mask rate for DSM - 2%
464
+ * mutations from template - 3
465
+ * mutations - I92V, T165S, L240I
466
+ * average esm2 pll - 1.23
467
+ * esmfold plddt - 0.75
468
+ * pTM (AlphaFold3) - 0.81
469
+ * ipTM (AlphaFold3) - 0.91
470
+
471
+ <table>
472
+ <tr>
473
+ <td>
474
+ <img width="596" height="355" alt="image" src="https://github.com/user-attachments/assets/0c6f690d-3134-44d2-9540-12c277e187b3" />
475
+ </td>
476
+ <td>
477
+ <img src="https://github.com/Gleghorn-Lab/DSM/blob/main/wetlab_result_analysis/egfr/kinetics/dsm_egfr_10_2.png" width="400">
478
+ </td>
479
+ </tr>
480
+ </table>
481
+
482
+ This sequence would have won the EGFR competition by a wide margin! Of course, we piggybacked on the winning entry as our template. We have been using DSM-PPI-full to attempt to replicate competition winning binders from the Cetuximab starting point instead. Stay tuned!
483
+
484
+ <img width="1865" height="548" alt="image" src="https://github.com/user-attachments/assets/ce486326-f4ba-4604-af47-259f7bbe496f" />
485
+
486
+
487
+ ### PD-L1
488
+
489
+ All 20 PD-L1 designs had high expression rates, with 15/20 binding. 1 weak, 10 medium, and 3 strong. The strongest presented with an average KD of 8.06 nM (pKD) , which is markedly less than the original template at 0.8 pM. We attribute the consistent binding but worse performance overall to the higher error between Synteract2 ppKD and true pKD of the template, implying it is not modeled well by our affinity system.
490
+
491
+ <img width="1592" height="516" alt="image" src="https://github.com/user-attachments/assets/6d2cde0e-75a4-4f29-999f-a8c601286845" />
492
+
493
+ <img src="https://github.com/Gleghorn-Lab/DSM/blob/main/wetlab_result_analysis/pdl1/kinetics/dsm_pdl1_7_1.png" width="400">
494
+
495
  ## Cite
496
  ```
497
  @misc{hallee2025diffusionsequencemodelsenhanced,