DNA To Proteins Translator

GPT-2 finetuned model for translate DNA into proteins sequences, trained on a large cross-species GenBank dataset.

Model Architecture

Base model: GPT-2
Approach: DNA To Proteins Translation

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
    task="gpt2-dna-translator",
    model="GustavoHCruz/DNATranslatorGPT2",
    trust_remote_code=True,
)

out = pipe({
    "sequence": "GTTTCTTTGCTTTTTAMGCTTGTATCTATTCTTCCATCGTAGACTGACCTGGTCATTTCTTTGCATCCAACGTA",
  "organism": "Homo sapiens"
})
print(out) # LTWSFLCIQR

out = pipe({
  "sequence": "ACACCAGCCTAGTTCTATGTCAGGTTCTAAAATATTTTCTGGTTCAATAAATAAAACATCAACATCTCACATAAAAGAAGTACGGAAAAGATTTAAAGGCAGTAACATATGAACGTAGGACGTTTAGGAGAAAAATGCTAAAAAAGTAGCTATTGTTAATTGAACATTACTCAGGGATGATCGGTTGTTTTTGTATTGACTTACCAAGACCACCATTGCCGAGTGCTGCATCCATTTCACGTTCTTCTAATTCTTCAATATCTAAATTCAACTCATAAAGAGCTTAATCA",
  "organism": "Rotaria socialis"
})
print(out) # MDAALGNGGL

This model uses the same maximum context length as the standard GPT-2 (1024 tokens). Its training was performed ensuring that the DNA sequence and the resulting protein would always fit within this context. An additional (and highly recommended) context is available: organism.

When using this pipeline, some rules will be applied to keep the model functioning in the same way as the training performed:

DNA sequences will be limited to 1000 tokens (each nucleotide becomes a token).
The organism (raw text) is limited to a maximum of 10 characters.
The generated response is limited to 1024 – the size of the received input. The minimum number of tokens generated, when the input is at its limit, is 11 new tokens.

Custom Usage Information

Prompt format:

The model expects the following input format:

<|DNA|>[DNA_G][DNA_T][DNA_T][DNA_T]...<|ORGANISM|>Homo sapiens

The model will generate a response in the following expected format:

<|PROTEIN|>[PROT_L][PROT_T][PROT_W]...<|END|>

Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.

Training

Trained on an architecture with 8x H100 GPUs.

Metrics

The model is still in the initial evaluation stages, and currently shows an average similarity of approximately 0.75 (calculated from the edit distance) with target sequences in the test set.

GitHub Repository

The full code for data processing, model training, and inference is available on GitHub:
CodingDNATransformers

You can find scripts for:

Preprocessing GenBank sequences
Fine-tuning models
Evaluating and using the trained models

Downloads last month: 50

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GustavoHCruz/DNATranslatorGPT2

Base model

openai-community/gpt2

Finetuned

(2067)

this model

Collection including GustavoHCruz/DNATranslatorGPT2

DNA Coding Regions

Collection

Collection of models and dataset for classifying introns and exons across species and DNA-to-proteins translation. • 7 items • Updated 6 days ago