Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +28 -8
finetuned_chandak_len117.pt +3 -0
finetuned_microsoft_dna_len110.pt +2 -2
finetuned_noisy_dna_len60.pt +2 -2
model_seq_len_110.pt +2 -2
model_seq_len_180.pt +2 -2
model_seq_len_60.pt +2 -2
model_var_len_50_120.pt +3 -0

README.md CHANGED Viewed

@@ -4,27 +4,47 @@ TReconLM is a decoder-only transformer model for trace reconstruction of noisy D
 ## Model Variants
-We provide pretrained and fine-tuned model checkpoints for the following ground-truth sequence lengths:
-- L = 60
-- L = 110
-- L = 180
 Each model supports reconstruction from cluster sizes between 2 and 10.
 ## How to Use
-A Colab notebook is available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `trace_reconstruction.ipynb`, which demonstrates how to load the model and run inference on our benchmark datasets. The test datasets used in the notebook can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
 ## Training Details
-- Models are pretrained on synthetic data generated by sampling ground-truth sequences of length L uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
 - Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
-- Models are fine-tuned on real-world sequencing data (Noisy-DNA and Microsoft datasets).
 For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
 ## Limitations
-Models are trained for fixed sequence lengths and may perform worse on other lengths or if the test data distribution differs significantly from the training data.

 ## Model Variants
+### Pretrained Models (Fixed Length)
+| Model | Sequence Length | Description |
+|-------|-----------------|-------------|
+| `model_seq_len_60.pt` | 60nt | Pretrained on synthetic IDS data |
+| `model_seq_len_110.pt` | 110nt | Pretrained on synthetic IDS data |
+| `model_seq_len_180.pt` | 180nt | Pretrained on synthetic IDS data |
+### Pretrained Models (Variable Length)
+| Model | Sequence Length | Description |
+|-------|-----------------|-------------|
+| `model_var_len_50_120.pt` | 50-120nt | Pretrained on synthetic IDS data with variable sequence lengths |
+### Fine-tuned Models
+| Model | Sequence Length | Description |
+|-------|-----------------|-------------|
+| `finetuned_noisy_dna_len60.pt` | 60nt | Fine-tuned on Noisy-DNA dataset |
+| `finetuned_microsoft_dna_len110.pt` | 110nt | Fine-tuned on Microsoft DNA dataset |
+| `finetuned_chandak_len117.pt` | 117nt | Fine-tuned on Chandak dataset |
 Each model supports reconstruction from cluster sizes between 2 and 10.
 ## How to Use
+Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
+- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
+- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA)
+The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
 ## Training Details
+- Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
 - Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
+- Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).
 For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
 ## Limitations
+Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as the fixed-length models, so it sees less data per sequence length and performs slightly worse for a specific fixed length.

finetuned_chandak_len117.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d613df4ec60c949c9cb4339615b2514ed4a5ddea03f5155c00d0996a76c782d9
+size 462508378

finetuned_microsoft_dna_len110.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2e06dee508c4828308560d4c8b88ccbe81d21e71d033479eb2582ca31143666d
-size 462508570

 version https://git-lfs.github.com/spec/v1
+oid sha256:4ad0828906d19fef925e578896a1c6fd5c0b8a5ce4b6783347a6b0c485c7915e
+size 134

finetuned_noisy_dna_len60.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fe28e1e3aa359b775c68e10b0469e42722cfda2d6e8c82b3976806f5b54533d3
-size 458207706

 version https://git-lfs.github.com/spec/v1
+oid sha256:902db2e9fe60b6d6d1d7e85a084f20ace4da996b04f58e4adb0906febbe2df7b
+size 134

model_seq_len_110.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:140ae26c126a74e936d6e247156ba7f3b639c710ddb5655a6c6c0b47e76f213a
-size 462507994

 version https://git-lfs.github.com/spec/v1
+oid sha256:42f3d744949f147d2e28209b3d3692c79dfab6617e90c85b833a6d3d4f0f357c
+size 134

model_seq_len_180.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:118c5ccbc5870667195dfb330d33d5c8a65f4f08bc9709ddc9879b0723b5c584
-size 468037594

 version https://git-lfs.github.com/spec/v1
+oid sha256:8d1ba6a598303f19d4fced383c642a1342367481681013435ae460a862ced81f
+size 134

model_seq_len_60.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a8a9d93f76963a998d47b54c3b5b9574a4e098448482c298a8f0079dbb5bde0a
-size 458207194

 version https://git-lfs.github.com/spec/v1
+oid sha256:4280f0cf80e39664fe568aa92f4dc22590f317d2257a4be3064e9dc2d4dfc15c
+size 134

model_var_len_50_120.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e41876248325048edfa0258ad5683d4863b86883902480bec3e3675b775e1039
+size 468037786