DNA To Proteins Translator

GPT-2 finetuned model for translate DNA into proteins sequences, trained on a large cross-species GenBank dataset.


Model Architecture

  • Base model: GPT-2
  • Approach: DNA To Proteins Translation

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
    task="gpt2-dna-translator",
    model="GustavoHCruz/DNATranslatorGPT2",
    trust_remote_code=True,
)

out = pipe({
    "sequence": "GTTTCTTTGCTTTTTAMGCTTGTATCTATTCTTCCATCGTAGACTGACCTGGTCATTTCTTTGCATCCAACGTA",
  "organism": "Homo sapiens"
})
print(out) # LTWSFLCIQR

out = pipe({
  "sequence": "ACACCAGCCTAGTTCTATGTCAGGTTCTAAAATATTTTCTGGTTCAATAAATAAAACATCAACATCTCACATAAAAGAAGTACGGAAAAGATTTAAAGGCAGTAACATATGAACGTAGGACGTTTAGGAGAAAAATGCTAAAAAAGTAGCTATTGTTAATTGAACATTACTCAGGGATGATCGGTTGTTTTTGTATTGACTTACCAAGACCACCATTGCCGAGTGCTGCATCCATTTCACGTTCTTCTAATTCTTCAATATCTAAATTCAACTCATAAAGAGCTTAATCA",
  "organism": "Rotaria socialis"
})
print(out) # MDAALGNGGL

This model uses the same maximum context length as the standard GPT-2 (1024 tokens). Its training was performed ensuring that the DNA sequence and the resulting protein would always fit within this context. An additional (and highly recommended) context is available: organism.

When using this pipeline, some rules will be applied to keep the model functioning in the same way as the training performed:

  • DNA sequences will be limited to 1000 tokens (each nucleotide becomes a token).
  • The organism (raw text) is limited to a maximum of 10 characters.
  • The generated response is limited to 1024 โ€“ the size of the received input. The minimum number of tokens generated, when the input is at its limit, is 11 new tokens.

Custom Usage Information

Prompt format:

The model expects the following input format:

<|DNA|>[DNA_G][DNA_T][DNA_T][DNA_T]...<|ORGANISM|>Homo sapiens

The model will generate a response in the following expected format:

<|PROTEIN|>[PROT_L][PROT_T][PROT_W]...<|END|>

Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.


Training

  • Trained on an architecture with 8x H100 GPUs.

Metrics

The model is still in the initial evaluation stages, and currently shows an average similarity of approximately 0.75 (calculated from the edit distance) with target sequences in the test set.


GitHub Repository

The full code for data processing, model training, and inference is available on GitHub:
CodingDNATransformers

You can find scripts for:

  • Preprocessing GenBank sequences
  • Fine-tuning models
  • Evaluating and using the trained models
Downloads last month
50
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for GustavoHCruz/DNATranslatorGPT2

Finetuned
(2067)
this model

Collection including GustavoHCruz/DNATranslatorGPT2