---
library_name: transformers
tags:
- pix2struct
- ocr
- receipts
- turkish
- image-to-text
- document-understanding
license: cc-by-nc-4.0
language:
- tr
base_model:
- google/pix2struct-base
metrics:
- val_edit_distance
---
## Model Details
This model is a fine-tuned version of [google/pix2struct-base](https://huggingface.co/google/pix2struct-base) on a private Turkish receipt dataset. It is capable of extracting structured text such as:
- Mağaza adı (store name)
- Toplam tutar (total amount)
- Tarih (date)
- Ürünler (line items)
- Document understanding for Turkish receipts
- Key information extraction from scanned or photographed receipts
- Useful for internal automation, ERP pre-filling, or financial apps
## ⚠️ Limitations
This model is an **experimental prototype** fine-tuned for Turkish receipt extraction using Pix2Struct on custom dataset. Approximately 1100 receipts has been used to fine tune it.
While it performs reasonably well on **short and clean receipts**, it has notable limitations:
- ❌ Performance degrades significantly on **long or complex receipts**
- ⚠️ **GPU usage is relatively high** due to the nature of the Pix2Struct architecture
- ❌ Not optimized for real-time or production-level use cases
- There are more robust and efficient alternatives.
During training, **various learning rates were tested**, but it’s possible that suboptimal hyperparameter tuning (especially learning rate and generation strategy) affected generalization.  
This may partially explain its weaker performance on longer receipts.
Additionally:
- ⚠️ The training data included **high-frequency entities**, such as the same store name (e.g. `"A101"`) appearing repeatedly. This may have caused the model to **hallucinate common store names** even when they're not present in the image.
- ⚠️ **Overfitting to dominant patterns** in the dataset (e.g. fixed receipt templates) may also reduce generalization
We recommend using this model primarily for research, benchmarking, or experimentation purposes.
## 📊 Metrics
The model was evaluated on a private Turkish receipt test set.
| Metric            | Score   |
|-------------------|---------|
| val_edit_distance | 0.112   |
## 📥 How to Use
```python
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
from PIL import Image
import torch
model_id = "turgutguvercin/pix2struct-turkish-receipts"
# Load model and processor
model = Pix2StructForConditionalGeneration.from_pretrained(model_id)
processor = Pix2StructProcessor.from_pretrained(model_id)
# Put model in eval mode and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Load your image (make sure it's in RGB)
image_path = "test_1.jpg"
image = Image.open(image_path).convert("RGB")
# Preprocess image
inputs = processor(images=image, return_tensors="pt").to(device)
# Generate prediction
with torch.no_grad():
    outputs = model.generate(**inputs,
        max_length=768,
        early_stopping=True,
        num_beams=1,  # Reduce beam search for faster processing
        do_sample=False,
        use_cache=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,)
# Decode tokens
predicted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(predicted_text)
# returns  SOK MARKETLER T.A.S. 81301318199 21/02/2025 MIS UHT SUT LAKTOZSU 1 x 39,75 %01  LEZZCAFE SALEP 17 GR 4 x 18,00 %01 0,57 57,75