--- library_name: transformers tags: - pix2struct - ocr - receipts - turkish - image-to-text - document-understanding license: cc-by-nc-4.0 language: - tr base_model: - google/pix2struct-base metrics: - val_edit_distance --- ## Model Details This model is a fine-tuned version of [google/pix2struct-base](https://huggingface.co/google/pix2struct-base) on a private Turkish receipt dataset. It is capable of extracting structured text such as: - Mağaza adı (store name) - Toplam tutar (total amount) - Tarih (date) - Ürünler (line items) - Document understanding for Turkish receipts - Key information extraction from scanned or photographed receipts - Useful for internal automation, ERP pre-filling, or financial apps ## ⚠️ Limitations This model is an **experimental prototype** fine-tuned for Turkish receipt extraction using Pix2Struct on custom dataset. Approximately 1100 receipts has been used to fine tune it. While it performs reasonably well on **short and clean receipts**, it has notable limitations: - ❌ Performance degrades significantly on **long or complex receipts** - ⚠️ **GPU usage is relatively high** due to the nature of the Pix2Struct architecture - ❌ Not optimized for real-time or production-level use cases - There are more robust and efficient alternatives. During training, **various learning rates were tested**, but it’s possible that suboptimal hyperparameter tuning (especially learning rate and generation strategy) affected generalization. This may partially explain its weaker performance on longer receipts. Additionally: - ⚠️ The training data included **high-frequency entities**, such as the same store name (e.g. `"A101"`) appearing repeatedly. This may have caused the model to **hallucinate common store names** even when they're not present in the image. - ⚠️ **Overfitting to dominant patterns** in the dataset (e.g. fixed receipt templates) may also reduce generalization We recommend using this model primarily for research, benchmarking, or experimentation purposes. ## 📊 Metrics The model was evaluated on a private Turkish receipt test set. | Metric | Score | |-------------------|---------| | val_edit_distance | 0.112 | ## 📥 How to Use ```python from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor from PIL import Image import torch model_id = "turgutguvercin/pix2struct-turkish-receipts" # Load model and processor model = Pix2StructForConditionalGeneration.from_pretrained(model_id) processor = Pix2StructProcessor.from_pretrained(model_id) # Put model in eval mode and move to device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() # Load your image (make sure it's in RGB) image_path = "test_1.jpg" image = Image.open(image_path).convert("RGB") # Preprocess image inputs = processor(images=image, return_tensors="pt").to(device) # Generate prediction with torch.no_grad(): outputs = model.generate(**inputs, max_length=768, early_stopping=True, num_beams=1, # Reduce beam search for faster processing do_sample=False, use_cache=True, pad_token_id=processor.tokenizer.pad_token_id, eos_token_id=processor.tokenizer.eos_token_id,) # Decode tokens predicted_text = processor.decode(outputs[0], skip_special_tokens=True) print(predicted_text) # returns SOK MARKETLER T.A.S. 81301318199 21/02/2025 MIS UHT SUT LAKTOZSU 1 x 39,75 %01 LEZZCAFE SALEP 17 GR 4 x 18,00 %01 0,57 57,75