Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -15,6 +15,7 @@ tags: []
|
|
| 15 |
- [Training](#training)
|
| 16 |
- [Evaluation](#evaluation)
|
| 17 |
- [Results](#results)
|
|
|
|
| 18 |
- [Cite](#cite)
|
| 19 |
|
| 20 |
## Introduction
|
|
@@ -238,8 +239,8 @@ Difference is statistically significant (p < 0.05)
|
|
| 238 |
|
| 239 |
1. **Clone the repository:**
|
| 240 |
```bash
|
| 241 |
-
git clone
|
| 242 |
-
cd
|
| 243 |
```
|
| 244 |
|
| 245 |
2. **Initialize the submodules:**
|
|
@@ -275,6 +276,16 @@ Difference is statistically significant (p < 0.05)
|
|
| 275 |
deactivate
|
| 276 |
```
|
| 277 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
## Training
|
| 279 |
|
| 280 |
The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/datasets/Synthyra/omg_prot50).
|
|
@@ -430,6 +441,57 @@ DSM demonstrates strong performance in both protein sequence generation and repr
|
|
| 430 |
|
| 431 |
These results highlight DSM's capability to unify high-quality protein representation learning and biologically coherent generative modeling within a single framework.
|
| 432 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 433 |
## Cite
|
| 434 |
```
|
| 435 |
@misc{hallee2025diffusionsequencemodelsenhanced,
|
|
|
|
| 15 |
- [Training](#training)
|
| 16 |
- [Evaluation](#evaluation)
|
| 17 |
- [Results](#results)
|
| 18 |
+
- [Experimental validation](#experimental-validation)
|
| 19 |
- [Cite](#cite)
|
| 20 |
|
| 21 |
## Introduction
|
|
|
|
| 239 |
|
| 240 |
1. **Clone the repository:**
|
| 241 |
```bash
|
| 242 |
+
git clone https://github.com/Gleghorn-Lab/DSM.git
|
| 243 |
+
cd DSM
|
| 244 |
```
|
| 245 |
|
| 246 |
2. **Initialize the submodules:**
|
|
|
|
| 276 |
deactivate
|
| 277 |
```
|
| 278 |
|
| 279 |
+
All together
|
| 280 |
+
```bash
|
| 281 |
+
git clone https://github.com/Gleghorn-Lab/DSM.git
|
| 282 |
+
cd DSM
|
| 283 |
+
git submodule update --init --remote --recursive
|
| 284 |
+
chmod +x setup_bioenv.sh
|
| 285 |
+
./setup_bioenv.sh
|
| 286 |
+
source ~/bioenv/bin/activate
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
## Training
|
| 290 |
|
| 291 |
The primary script for training models is `training/train_dsm.py`. This script further pretrains an ESM2 checkpoint using the DSM objective (masked diffusion based on LLaDA) on a large protein sequence dataset like [OMG-prot50](https://huggingface.co/datasets/Synthyra/omg_prot50).
|
|
|
|
| 441 |
|
| 442 |
These results highlight DSM's capability to unify high-quality protein representation learning and biologically coherent generative modeling within a single framework.
|
| 443 |
|
| 444 |
+
## Experimental validation
|
| 445 |
+
|
| 446 |
+
We validated various DSM generated binders using biolayer interferometry through [Adaptyv Bio](https://www.adaptyvbio.com/) - sending in 20 designs for EGFR and PD-L1. Sequences generated via unconditional generation were sorted hierarchically by predicted binding affinity (Synteract2), then ESMfold pLDDT, ESM2 PLL, and finally Chai1 iPTM.
|
| 447 |
+
|
| 448 |
+
### EGFR
|
| 449 |
+
|
| 450 |
+
Of the 13 expressed, 12 designs bound to EGFR with 11 of them binding strongly. Noteably, the top design, `dsm_egfr_10`, presented a mean KD in the picomolar range (861 pM), which is a ~30% increase in binding affinity vs. the winner (and our starting template) of the Adaptyv EGFR competition at 1.21 nM, and ~90% over the original starting scFV Cetuximab at 664 nM.
|
| 451 |
+
|
| 452 |
+
<img width="1593" height="527" alt="image" src="https://github.com/user-attachments/assets/f6f6a614-d5f4-4e9e-b7c9-fc0fdf9dc1e2" />
|
| 453 |
+
|
| 454 |
+
|
| 455 |
+
**dsm_egfr_10**
|
| 456 |
+
```
|
| 457 |
+
QVQLQQSGPGLVQPSQSLSITCTVSGFSLTNYGVHWVRQSPGKGLEWLGVIWSGGNTDYNTPFTSRLSISRDTSKSQVFFKMNSLQTDDTAVYYCARALTYYDYEFAYWGQGTLVTVSAGGGGSGGGGSGGGGSDILLTQSPVILSVSPGERVSFSCRASQSIGSNIHWYQQRTNGSPKLLIRYASESISGIPSRFSGSGSGTDFTLSINSVDPEDIADYYCQQNNNWPTTFGAGTKLEIK
|
| 458 |
+
```
|
| 459 |
+
* KD - 861 pM
|
| 460 |
+
* pKD - 9.06
|
| 461 |
+
* PPI probability (Synteract2) - 0.9991
|
| 462 |
+
* predicted pKD (Synteract2) - 8.94
|
| 463 |
+
* mask rate for DSM - 2%
|
| 464 |
+
* mutations from template - 3
|
| 465 |
+
* mutations - I92V, T165S, L240I
|
| 466 |
+
* average esm2 pll - 1.23
|
| 467 |
+
* esmfold plddt - 0.75
|
| 468 |
+
* pTM (AlphaFold3) - 0.81
|
| 469 |
+
* ipTM (AlphaFold3) - 0.91
|
| 470 |
+
|
| 471 |
+
<table>
|
| 472 |
+
<tr>
|
| 473 |
+
<td>
|
| 474 |
+
<img width="596" height="355" alt="image" src="https://github.com/user-attachments/assets/0c6f690d-3134-44d2-9540-12c277e187b3" />
|
| 475 |
+
</td>
|
| 476 |
+
<td>
|
| 477 |
+
<img src="https://github.com/Gleghorn-Lab/DSM/blob/main/wetlab_result_analysis/egfr/kinetics/dsm_egfr_10_2.png" width="400">
|
| 478 |
+
</td>
|
| 479 |
+
</tr>
|
| 480 |
+
</table>
|
| 481 |
+
|
| 482 |
+
This sequence would have won the EGFR competition by a wide margin! Of course, we piggybacked on the winning entry as our template. We have been using DSM-PPI-full to attempt to replicate competition winning binders from the Cetuximab starting point instead. Stay tuned!
|
| 483 |
+
|
| 484 |
+
<img width="1865" height="548" alt="image" src="https://github.com/user-attachments/assets/ce486326-f4ba-4604-af47-259f7bbe496f" />
|
| 485 |
+
|
| 486 |
+
|
| 487 |
+
### PD-L1
|
| 488 |
+
|
| 489 |
+
All 20 PD-L1 designs had high expression rates, with 15/20 binding. 1 weak, 10 medium, and 3 strong. The strongest presented with an average KD of 8.06 nM (pKD) , which is markedly less than the original template at 0.8 pM. We attribute the consistent binding but worse performance overall to the higher error between Synteract2 ppKD and true pKD of the template, implying it is not modeled well by our affinity system.
|
| 490 |
+
|
| 491 |
+
<img width="1592" height="516" alt="image" src="https://github.com/user-attachments/assets/6d2cde0e-75a4-4f29-999f-a8c601286845" />
|
| 492 |
+
|
| 493 |
+
<img src="https://github.com/Gleghorn-Lab/DSM/blob/main/wetlab_result_analysis/pdl1/kinetics/dsm_pdl1_7_1.png" width="400">
|
| 494 |
+
|
| 495 |
## Cite
|
| 496 |
```
|
| 497 |
@misc{hallee2025diffusionsequencemodelsenhanced,
|