arxiv:2511.14806

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Published on Nov 17

· Submitted by

Siyuan Li on Nov 24

Westlake University

Upvote

Authors:

Siyuan Li ,

Abstract

MergeDNA uses a hierarchical architecture with Token Merging and latent Transformers to model genomic sequences, achieving superior performance on DNA benchmarks and multi-omics tasks.

AI-generated summary

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

View arXiv page View PDF Add to collection

Community

Lupin1998

Paper author Paper submitter 6 days ago

MergeDNA (AAAI'26 Oral Presentation)

To tackle the tokenization problem in genomics, we introduce a hierarchical genome sequence modeling framework, called MergeDNA, that learns dynamic tokenization with context-aware pre-training tasks (merged token reconstruction and adaptive masked token prediction) through Token Merging techniques.

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.14806 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.14806 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.14806 in a Space README.md to link it from this page.