Upload folder using huggingface_hub
Browse files- README.md +28 -8
- finetuned_chandak_len117.pt +3 -0
- finetuned_microsoft_dna_len110.pt +2 -2
- finetuned_noisy_dna_len60.pt +2 -2
- model_seq_len_110.pt +2 -2
- model_seq_len_180.pt +2 -2
- model_seq_len_60.pt +2 -2
- model_var_len_50_120.pt +3 -0
README.md
CHANGED
|
@@ -4,27 +4,47 @@ TReconLM is a decoder-only transformer model for trace reconstruction of noisy D
|
|
| 4 |
|
| 5 |
## Model Variants
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
Each model supports reconstruction from cluster sizes between 2 and 10.
|
| 14 |
|
| 15 |
## How to Use
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
| 18 |
|
|
|
|
| 19 |
|
| 20 |
## Training Details
|
| 21 |
|
| 22 |
-
- Models are pretrained on synthetic data generated by sampling ground-truth sequences
|
| 23 |
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
|
| 24 |
-
- Models are fine-tuned on real-world sequencing data (Noisy-DNA and
|
| 25 |
|
| 26 |
For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
|
| 27 |
|
| 28 |
## Limitations
|
| 29 |
|
| 30 |
-
Models
|
|
|
|
| 4 |
|
| 5 |
## Model Variants
|
| 6 |
|
| 7 |
+
### Pretrained Models (Fixed Length)
|
| 8 |
|
| 9 |
+
| Model | Sequence Length | Description |
|
| 10 |
+
|-------|-----------------|-------------|
|
| 11 |
+
| `model_seq_len_60.pt` | 60nt | Pretrained on synthetic IDS data |
|
| 12 |
+
| `model_seq_len_110.pt` | 110nt | Pretrained on synthetic IDS data |
|
| 13 |
+
| `model_seq_len_180.pt` | 180nt | Pretrained on synthetic IDS data |
|
| 14 |
+
|
| 15 |
+
### Pretrained Models (Variable Length)
|
| 16 |
+
|
| 17 |
+
| Model | Sequence Length | Description |
|
| 18 |
+
|-------|-----------------|-------------|
|
| 19 |
+
| `model_var_len_50_120.pt` | 50-120nt | Pretrained on synthetic IDS data with variable sequence lengths |
|
| 20 |
+
|
| 21 |
+
### Fine-tuned Models
|
| 22 |
+
|
| 23 |
+
| Model | Sequence Length | Description |
|
| 24 |
+
|-------|-----------------|-------------|
|
| 25 |
+
| `finetuned_noisy_dna_len60.pt` | 60nt | Fine-tuned on Noisy-DNA dataset |
|
| 26 |
+
| `finetuned_microsoft_dna_len110.pt` | 110nt | Fine-tuned on Microsoft DNA dataset |
|
| 27 |
+
| `finetuned_chandak_len117.pt` | 117nt | Fine-tuned on Chandak dataset |
|
| 28 |
|
| 29 |
Each model supports reconstruction from cluster sizes between 2 and 10.
|
| 30 |
|
| 31 |
## How to Use
|
| 32 |
|
| 33 |
+
Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
|
| 34 |
+
|
| 35 |
+
- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
|
| 36 |
+
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA)
|
| 37 |
|
| 38 |
+
The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
|
| 39 |
|
| 40 |
## Training Details
|
| 41 |
|
| 42 |
+
- Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
|
| 43 |
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
|
| 44 |
+
- Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).
|
| 45 |
|
| 46 |
For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
|
| 47 |
|
| 48 |
## Limitations
|
| 49 |
|
| 50 |
+
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as the fixed-length models, so it sees less data per sequence length and performs slightly worse for a specific fixed length.
|
finetuned_chandak_len117.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d613df4ec60c949c9cb4339615b2514ed4a5ddea03f5155c00d0996a76c782d9
|
| 3 |
+
size 462508378
|
finetuned_microsoft_dna_len110.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4ad0828906d19fef925e578896a1c6fd5c0b8a5ce4b6783347a6b0c485c7915e
|
| 3 |
+
size 134
|
finetuned_noisy_dna_len60.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:902db2e9fe60b6d6d1d7e85a084f20ace4da996b04f58e4adb0906febbe2df7b
|
| 3 |
+
size 134
|
model_seq_len_110.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:42f3d744949f147d2e28209b3d3692c79dfab6617e90c85b833a6d3d4f0f357c
|
| 3 |
+
size 134
|
model_seq_len_180.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8d1ba6a598303f19d4fced383c642a1342367481681013435ae460a862ced81f
|
| 3 |
+
size 134
|
model_seq_len_60.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4280f0cf80e39664fe568aa92f4dc22590f317d2257a4be3064e9dc2d4dfc15c
|
| 3 |
+
size 134
|
model_var_len_50_120.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e41876248325048edfa0258ad5683d4863b86883902480bec3e3675b775e1039
|
| 3 |
+
size 468037786
|