FWeindel commited on
Commit
025163e
·
verified ·
1 Parent(s): 950d152

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -4,27 +4,47 @@ TReconLM is a decoder-only transformer model for trace reconstruction of noisy D
4
 
5
  ## Model Variants
6
 
7
- We provide pretrained and fine-tuned model checkpoints for the following ground-truth sequence lengths:
8
 
9
- - L = 60
10
- - L = 110
11
- - L = 180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  Each model supports reconstruction from cluster sizes between 2 and 10.
14
 
15
  ## How to Use
16
 
17
- A Colab notebook is available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `trace_reconstruction.ipynb`, which demonstrates how to load the model and run inference on our benchmark datasets. The test datasets used in the notebook can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
 
 
 
18
 
 
19
 
20
  ## Training Details
21
 
22
- - Models are pretrained on synthetic data generated by sampling ground-truth sequences of length L uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
23
  - Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
24
- - Models are fine-tuned on real-world sequencing data (Noisy-DNA and Microsoft datasets).
25
 
26
  For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
27
 
28
  ## Limitations
29
 
30
- Models are trained for fixed sequence lengths and may perform worse on other lengths or if the test data distribution differs significantly from the training data.
 
4
 
5
  ## Model Variants
6
 
7
+ ### Pretrained Models (Fixed Length)
8
 
9
+ | Model | Sequence Length | Description |
10
+ |-------|-----------------|-------------|
11
+ | `model_seq_len_60.pt` | 60nt | Pretrained on synthetic IDS data |
12
+ | `model_seq_len_110.pt` | 110nt | Pretrained on synthetic IDS data |
13
+ | `model_seq_len_180.pt` | 180nt | Pretrained on synthetic IDS data |
14
+
15
+ ### Pretrained Models (Variable Length)
16
+
17
+ | Model | Sequence Length | Description |
18
+ |-------|-----------------|-------------|
19
+ | `model_var_len_50_120.pt` | 50-120nt | Pretrained on synthetic IDS data with variable sequence lengths |
20
+
21
+ ### Fine-tuned Models
22
+
23
+ | Model | Sequence Length | Description |
24
+ |-------|-----------------|-------------|
25
+ | `finetuned_noisy_dna_len60.pt` | 60nt | Fine-tuned on Noisy-DNA dataset |
26
+ | `finetuned_microsoft_dna_len110.pt` | 110nt | Fine-tuned on Microsoft DNA dataset |
27
+ | `finetuned_chandak_len117.pt` | 117nt | Fine-tuned on Chandak dataset |
28
 
29
  Each model supports reconstruction from cluster sizes between 2 and 10.
30
 
31
  ## How to Use
32
 
33
+ Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
34
+
35
+ - `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
36
+ - `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA)
37
 
38
+ The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
39
 
40
  ## Training Details
41
 
42
+ - Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
43
  - Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
44
+ - Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).
45
 
46
  For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
47
 
48
  ## Limitations
49
 
50
+ Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as the fixed-length models, so it sees less data per sequence length and performs slightly worse for a specific fixed length.
finetuned_chandak_len117.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d613df4ec60c949c9cb4339615b2514ed4a5ddea03f5155c00d0996a76c782d9
3
+ size 462508378
finetuned_microsoft_dna_len110.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2e06dee508c4828308560d4c8b88ccbe81d21e71d033479eb2582ca31143666d
3
- size 462508570
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ad0828906d19fef925e578896a1c6fd5c0b8a5ce4b6783347a6b0c485c7915e
3
+ size 134
finetuned_noisy_dna_len60.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fe28e1e3aa359b775c68e10b0469e42722cfda2d6e8c82b3976806f5b54533d3
3
- size 458207706
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:902db2e9fe60b6d6d1d7e85a084f20ace4da996b04f58e4adb0906febbe2df7b
3
+ size 134
model_seq_len_110.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:140ae26c126a74e936d6e247156ba7f3b639c710ddb5655a6c6c0b47e76f213a
3
- size 462507994
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42f3d744949f147d2e28209b3d3692c79dfab6617e90c85b833a6d3d4f0f357c
3
+ size 134
model_seq_len_180.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:118c5ccbc5870667195dfb330d33d5c8a65f4f08bc9709ddc9879b0723b5c584
3
- size 468037594
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d1ba6a598303f19d4fced383c642a1342367481681013435ae460a862ced81f
3
+ size 134
model_seq_len_60.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a8a9d93f76963a998d47b54c3b5b9574a4e098448482c298a8f0079dbb5bde0a
3
- size 458207194
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4280f0cf80e39664fe568aa92f4dc22590f317d2257a4be3064e9dc2d4dfc15c
3
+ size 134
model_var_len_50_120.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e41876248325048edfa0258ad5683d4863b86883902480bec3e3675b775e1039
3
+ size 468037786