File size: 10,202 Bytes
f771c82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
license: mit
language:
- en
library_name: onnx
tags:
- biology
- protein
- onnx
- encoder-only
- embeddings
- bioinformatics
- protein-language-model
- t5
- transformer
- half-precision
- fp16
datasets:
- UniRef50
inference: true
model-index:
- name: prot-t5-xl-uniref50-enc-onnx
  results:
  - task:
      type: feature-extraction
      name: Protein Embedding Generation
    metrics:
    - type: cosine_similarity
      value: 1.000244
      name: PyTorch-ONNX Similarity
    - type: max_absolute_difference
      value: 0.001953
      name: Maximum Absolute Difference
---

# ProtT5-XL-UniRef50 Encoder (ONNX, Half-Precision)

**An optimized ONNX version of the encoder-only, half-precision ProtT5-XL-UniRef50 model for efficient protein embeddings.**

This is an ONNX-converted version of [`Rostlab/prot_t5_xl_half_uniref50-enc`](https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc), optimized for production inference.

## Model Description

ProtT5-XL-UniRef50 is based on the T5-3B model and was pretrained on a large corpus of protein sequences in a self-supervised fashion. This ONNX version contains only the encoder portion using half precision (float16), enabling efficient protein/amino acid representation generation.

**Key Features:**

- πŸš€ **Optimized Inference**: ONNX Runtime for more flexible inference deployment
- πŸ’Ύ **Reduced Memory**: Half-precision with external weights (2.25GB total)
- ⚑ **Hardware Agnostic**: Supports CPU, GPU, and specialized accelerators

### Conversion Details

This model was converted from the original PyTorch model with the following optimizations:

- **ONNX Opset**: Version 14 for broad compatibility
- **Precision**: FP16 (half-precision) maintained from original
- **External Weights**: Large weight matrices stored separately for efficient loading
- **Dynamic Shapes**: Supports variable batch sizes and sequence lengths

## Performance Metrics

**Accuracy Validation** (PyTorch vs ONNX):

- **Average Cosine Similarity**: 1.000244 (near-perfect preservation)
- **Maximum Absolute Difference**: 0.001953 (minimal numerical differences)
- **All test cases**: Cosine similarity > 0.998

**Inference Performance** (example results):

- **Short sequences** (7-20 AAs): ~750ms
- **Long sequences** (140+ AAs): ~1.1s  
- **Performance varies** by hardware and ONNX Runtime providers

**Performance Notes:**

- **Apple Silicon (M4 Macs)**: PyTorch with MPS backend typically outperforms ONNX CPU inference due to optimized GPU acceleration. Use PyTorch for local M4 development.
- **NVIDIA GPUs**: ONNX Runtime with CUDA provider is often competitive with or faster than PyTorch for inference, thanks to aggressive graph optimizations and inference-specific tuning.
- **Deployment**: ONNX provides consistent cross-platform performance and easier production deployment regardless of hardware.

## Getting the Model

### System Requirements

```bash
pip install onnxruntime transformers
```

### Option 1: Hugging Face CLI (Recommended)

[First install huggingface-cli](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli).

```bash
# Download the model
huggingface-cli download Rostlab/prot-t5-xl-uniref50-enc-onnx --local-dir ./prot_t5_onnx
```

### Option 2: Git LFS

```bash
# Clone the repository (requires Git LFS for large files)
git lfs install
git clone https://huggingface.co/Rostlab/prot-t5-xl-uniref50-enc-onnx
cd prot-t5-xl-uniref50-enc-onnx
```

## Usage

### ONNX Runtime (Recommended)

```python
import onnxruntime as ort
import numpy as np
from transformers import T5Tokenizer
import re

# Load tokenizer from local directory (after download)
tokenizer = T5Tokenizer.from_pretrained("./", do_lower_case=False, legacy=False)

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Example sequences
sequences = ["PRTEINO", "SEQWENCE"]

# Preprocess: replace rare amino acids and add spaces
sequences = [" ".join(list(re.sub(r"[UZOB]", "X", seq))) for seq in sequences]

# Tokenize
inputs = tokenizer(sequences, return_tensors="np", padding=True, truncation=False)

# Run inference
outputs = session.run(
    None, 
    {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64)
    }
)

embeddings = outputs[0]  # Shape: [batch_size, seq_len, 1024]

# Extract per-residue embeddings (removing special tokens)
emb_0 = embeddings[0, 1:len(sequences[0].split())+1]  # First sequence
emb_1 = embeddings[1, 1:len(sequences[1].split())+1]  # Second sequence

# Per-protein embeddings (mean pooling)
protein_emb_0 = np.mean(emb_0, axis=0)  # Shape: [1024]
protein_emb_1 = np.mean(emb_1, axis=0)  # Shape: [1024]

print(f"Protein 1 embedding shape: {protein_emb_0.shape}")
print(f"Protein 2 embedding shape: {protein_emb_1.shape}")
```



## Model Architecture

- **Base Model**: T5-3B Encoder
- **Hidden Size**: 1024
- **Layers**: 24
- **Attention Heads**: 16
- **Parameters**: ~1.2B (encoder only)
- **Precision**: FP16 (half-precision)
- **Input**: Tokenized amino acid sequences
- **Output**: Dense embeddings (1024-dimensional)

## Training Data

Pretrained on UniRef50 containing protein sequences with a BART-like MLM denoising objective:

- **Masking**: 15% of amino acids randomly masked
- **Vocabulary**: 20 standard amino acids + special tokens
- **Preprocessing**: Rare amino acids (U/Z/O/B) replaced with X
- **Format**: Space-separated amino acids (required for T5 tokenizer)

## Intended Use

### Primary Use Cases

- **Protein Embeddings**: Generate dense vector representations of proteins
- **Feature Extraction**: Create features for downstream ML models
- **Similarity Analysis**: Compute protein sequence similarities
- **Protein Classification**: As feature extractor for classification tasks
- **Production Inference**: High-throughput protein processing
- **Model Deployment**: Optimized inference for serving applications

### Limitations

- **Uppercase Only**: Requires uppercase amino acid sequences
- **Memory Requirements**: ~3GB total (model + weights)
- **Sequence Length**: Optimal for sequences up to 1024 amino acids
- **Domain**: Limited to natural protein sequences

## Repository Contents

### Model Files

```bash
β”œβ”€β”€ model.onnx                    # Main ONNX model (573KB)
β”œβ”€β”€ shared.weight                 # Shared parameters (256KB)
β”œβ”€β”€ onnx__MatMul_[0-143]          # External weight matrices (2.25GB total)
β”œβ”€β”€ spiece.model                  # SentencePiece tokenizer (238KB)
β”œβ”€β”€ tokenizer_config.json         # Tokenizer configuration
β”œβ”€β”€ special_tokens_map.json       # Special tokens mapping
└── added_tokens.json             # Additional tokens
```

### Scripts

```bash
β”œβ”€β”€ convert.py                    # ONNX conversion script
└── test_onnx.py                  # Model validation and testing
```

## Conversion & Testing Scripts

### convert.py

Script to convert ProtT5 encoder models to ONNX format:

```bash
# Convert the original model to ONNX
python convert.py --model_name Rostlab/prot_t5_xl_half_uniref50-enc --output_dir ./prot_t5_onnx

# Options:
# --model_name: Hugging Face model identifier
# --output_dir: Directory to save converted model
# --max_sequence_length: Maximum sequence length (default: 1024)
# --fp16: Use half precision (default: True)
# --no_fp16: Disable half precision
```

**Features:**

- Converts to ONNX opset 14 with dynamic shapes
- Preserves half-precision (FP16) from original model
- Exports external weights for large matrices
- Includes tokenizer files for complete package

### test_onnx.py

Script to validate ONNX model accuracy and performance:

```bash
# Test the converted model
python test_onnx.py --onnx_dir ./prot_t5_onnx

# Test with custom sequences
python test_onnx.py --onnx_dir ./prot_t5_onnx --sequences "MKFVPKX" "ACDEFG"

# Options:
# --onnx_dir: Directory containing ONNX model and tokenizer
# --original_model: Original PyTorch model for comparison
# --sequences: Custom protein sequences to test
```

**Validation Features:**

- **Accuracy Testing**: Compares PyTorch vs ONNX outputs
- **Performance Benchmarking**: Measures inference speed
- **Cosine Similarity**: Validates embedding preservation (>99.8%)
- **Rare Amino Acids**: Tests U/Z/O/B β†’ X replacement
- **Variable Lengths**: Tests sequences from short to long

**Example Output:**

```txt
PROTΠ’5 ONNX TEST RESULTS
============================================================
Accuracy Tests: βœ… PASS
  Average Cosine Similarity: 1.000244
  Maximum Absolute Difference: 0.001953

Performance Results:
  PyTorch Average Time: 0.2140s
  ONNX Average Time: 0.1580s
  Speedup: 1.35x
============================================================
```

## Conversion Process

This ONNX model was converted using the following process:

1. **Base Model**: `Rostlab/prot_t5_xl_half_uniref50-enc`
2. **Conversion Tool**: PyTorch β†’ ONNX with optimizations
3. **Validation**: Comprehensive accuracy testing vs original PyTorch model
4. **Optimization**: External weights, dynamic shapes, FP16 precision

## Citation

```bibtex
@article{Elnaggar2020.07.12.199554,
    author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
    title = {ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing},
    elocation-id = {2020.07.12.199554},
    year = {2020},
    doi = {10.1101/2020.07.12.199554},
    publisher = {Cold Spring Harbor Laboratory},
    journal = {bioRxiv}
}
```

## License

MIT License 

## Related Models

- **Original PyTorch**: [`Rostlab/prot_t5_xl_half_uniref50-enc`](https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc)
- **Full Precision**: [`Rostlab/prot_t5_xl_uniref50`](https://huggingface.co/Rostlab/prot_t5_xl_uniref50)
- **Other ProtTrans Models**: [ProtTrans Collection](https://huggingface.co/models?search=prot_t5)

## Contact

For questions about this ONNX conversion or deployment, please open an issue on the model repository.