---
library_name: transformers
tags:
- pix2struct
- ocr
- receipts
- turkish
- image-to-text
- document-understanding
license: cc-by-nc-4.0
language:
- tr
base_model:
- google/pix2struct-base
metrics:
- val_edit_distance
---


## Model Details

This model is a fine-tuned version of [google/pix2struct-base](https://huggingface.co/google/pix2struct-base) on a private Turkish receipt dataset. It is capable of extracting structured text such as:

- Mağaza adı (store name)
- Toplam tutar (total amount)
- Tarih (date)
- Ürünler (line items)

- Document understanding for Turkish receipts
- Key information extraction from scanned or photographed receipts
- Useful for internal automation, ERP pre-filling, or financial apps

## ⚠️ Limitations

This model is an **experimental prototype** fine-tuned for Turkish receipt extraction using Pix2Struct on custom dataset. Approximately 1100 receipts has been used to fine tune it.

While it performs reasonably well on **short and clean receipts**, it has notable limitations:

- ❌ Performance degrades significantly on **long or complex receipts**
- ⚠️ **GPU usage is relatively high** due to the nature of the Pix2Struct architecture
- ❌ Not optimized for real-time or production-level use cases
- There are more robust and efficient alternatives.

During training, **various learning rates were tested**, but it’s possible that suboptimal hyperparameter tuning (especially learning rate and generation strategy) affected generalization.  
This may partially explain its weaker performance on longer receipts.

Additionally:
- ⚠️ The training data included **high-frequency entities**, such as the same store name (e.g. `"A101"`) appearing repeatedly. This may have caused the model to **hallucinate common store names** even when they're not present in the image.
- ⚠️ **Overfitting to dominant patterns** in the dataset (e.g. fixed receipt templates) may also reduce generalization

We recommend using this model primarily for research, benchmarking, or experimentation purposes.

## 📊 Metrics

The model was evaluated on a private Turkish receipt test set.

| Metric            | Score   |
|-------------------|---------|
| val_edit_distance | 0.112   |


## 📥 How to Use

```python
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
from PIL import Image
import torch

model_id = "turgutguvercin/pix2struct-turkish-receipts"

# Load model and processor
model = Pix2StructForConditionalGeneration.from_pretrained(model_id)
processor = Pix2StructProcessor.from_pretrained(model_id)

# Put model in eval mode and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()


# Load your image (make sure it's in RGB)
image_path = "test_1.jpg"
image = Image.open(image_path).convert("RGB")

# Preprocess image
inputs = processor(images=image, return_tensors="pt").to(device)

# Generate prediction
with torch.no_grad():
    outputs = model.generate(**inputs,
        max_length=768,
        early_stopping=True,
        num_beams=1,  # Reduce beam search for faster processing
        do_sample=False,
        use_cache=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,)

# Decode tokens
predicted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(predicted_text)
# returns <s_store_name> SOK MARKETLER T.A.S.</s_store_name><s_tax_id> 81301318199</s_tax_id><s_date> 21/02/2025</s_date><s_menu><s_nm> MIS UHT SUT LAKTOZSU</s_nm><s_cnt> 1 x</s_cnt><s_price> 39,75</s_price><s_tax_rate> %01</s_tax_rate> <sep/><s_nm> LEZZCAFE SALEP 17 GR</s_nm><s_cnt> 4 x</s_cnt><s_price> 18,00</s_price><s_tax_rate> %01</s_tax_rate></s_menu><s_sub_total><s_tax_price> 0,57</s_tax_price></s_sub_total><s_total><s_total_price> 57,75</s_total_price></s_total>