Japanese-Receipt-VL-3B-JSON

Model Description

Japanese-Receipt-VL-3B-JSON is a fine-tuned vision-language model based on Qwen2.5-VL-3B, specifically optimized for Japanese receipt OCR and structured data extraction. The model processes mobile phone-captured receipt images and outputs structured JSON containing store information, itemized purchases, tax calculations, and payment details.

Dataset: Trained on the Japanese-Mobile-Receipt-OCR-1K dataset.

Model Details

Model Name: Japanese-Receipt-VL-3B-JSON
Dataset Name: Japanese-Mobile-Receipt-OCR-1K
Base Model: Qwen/Qwen2.5-VL-3B-Instruct
Model Type: Vision-Language Model (Multimodal)
Language: Japanese (preserves original text exactly as printed)
License: Same as base model
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Training Images: 1,000 real-world Japanese receipt images
Total Dataset: 1,147 collected images
Output Format: Structured JSON with Japanese keys
Extraction Approach: Comprehensive, exact preservation without translation or interpretation

Intended Use

Primary Use Cases

Japanese retail receipt digitization
Expense tracking and management systems
Business accounting automation
Mobile receipt scanning applications
E-commerce and point-of-sale integrations

Input

Mobile phone-captured images of Japanese receipts
Supported formats: JPG, PNG
Optimal resolution: 640-896px (portrait) or 896-640px (landscape)

Output

Structured JSON containing:

{
    "店舗名": "セブンイレブン渋谷店",
    "日付": "2024年01月15日",
    "時刻": "14:30",
    "レシートNo": "0001234",
    "商品リスト": [
        {
            "商品名": "おにぎり鮭",
            "数量": 1,
            "単価": 128,
            "金額": 128
        }
    ],
    "小計": 840,
    "消費税": 84,
    "合計": 924,
    "支払方法": "現金",
    "お預り": 1000,
    "お釣り": 76
}

Prompting Guidelines

Optimized Instruction Prompt

This model was trained with a specific instruction prompt that ensures comprehensive and accurate extraction. For best results, use the exact prompt shown in the usage example above.

Key Prompt Features

Comprehensive extraction: Captures all visible information including store details, timestamps, itemized products, tax calculations, and payment information
Exact preservation: Maintains original Japanese text, formatting, and symbols without translation or modification
JSON structure: Uses Japanese field names as keys for cultural and linguistic consistency
No inference: Only extracts explicitly visible information, avoiding assumptions or corrections
Complete coverage: Includes often-missed elements like receipt numbers, cashier names, footer messages, and promotional text

Sample Results

Example Input

Example Output

{
  "領収書": {
    "店舗情報": [
      {
        "店名": "業務スーパー河内屋",
        "場所": "青葉台店",
        "電話": "045-985-9603"
      }
    ]
  },
  "登録番号": "T7011701002269",
  "メッセージ": {
    "内容": "プロの品質とプロの価格 業務スーパーでは毎日のお買い得となっています。"
  },
  "発行日": {
    "月日年分類": "72025年 7月12日",
    "担当者": "土田"
  },
  "商品情報": [
    {
      "名前": "鶏卵赤玉MSP",
      "個数": "10個入り",
      "価格": "￥237"
    },
    {
      "名前": "マカロニ(セダーニーニー)500G。",
      "個数": "￥138",
      "商品詳細": "021 4901995400119"
    },
    {
      "名前": "JUCO VIA(業)チュータースライサイズ",
      "価格": "￥2418"
    },
    {
      "名前": "協同牛乳 酪農牛乳 1L",
      "価格": "￥199",
      "商品詳細": "021 496168110063"
    },
    {
      "名前": "おかも納豆極小粒ミニ",
      "価格": "￥76"
    },
    {
      "名前": "★協同牛乳",
      "個数": "2418",
      "商品詳細": "021 4901160010143"
    }
  ],
  "税情報": {
    "轻減税率対象": {
      "金額": "1,068",
      "割合": "％",
      "外税買上額": "￥1,068",
      "外税額": "￥85",
      "税額合計": "￥85"
    }
  },
  "合計情報": {
    "金額": {
      "合計": "¥1,153",
      "外税合計": {
        "金額": "1,068",
        "割合": "％",
        "外税合計": "¥1,068",
        "外税合計割合": "％",
        "税額合計": "￥85"
      }
    },
    "お預り": "¥5,000",
    "お釣り": "￥3,847"
  },
  "備考": {
    "内容": "本問保管上のお願い財布・手帳等にはさんで保管頂く場合は、印刷面を内側に折り保管をお願いします"
  },
  "バーコード": {
    "6829": "6821",
    "日付": "11:20"
  }
}

This example demonstrates the model's ability to:

Extract complete store information including registration numbers
Capture all product details with prices and barcodes
Parse complex tax calculations and reduced tax rates
Preserve promotional messages and handling instructions
Maintain exact Japanese formatting and terminology

Training Details

Dataset

Source: Custom collected real-world Japanese receipts
Collection Method: Mobile phone photography in various lighting conditions
Training Split: 1,000 images
Validation Split: 147 images
Receipt Types: Convenience stores, restaurants, supermarkets, department stores
Image Preprocessing: Balanced resize (640x896 target), aspect ratio preservation

Training Configuration

Sequence Length: 1,536 tokens (optimized for Japanese text)
Batch Size: 1 (with gradient accumulation)
Learning Rate: 5e-6 (conservative for stability)
LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Dropout: 0.1
- Target modules: Attention layers (q_proj, k_proj, v_proj, o_proj)
Training Hardware: Tesla T4 GPU
Training Framework: Unsloth + HuggingFace Transformers

Training Optimizations

Custom image preprocessing for mobile receipt photos
Balanced aspect ratio handling (portrait/landscape)
Even dimension sizing to prevent tokenization issues
Conservative training parameters to avoid NaN issues
Japanese text tokenization optimizations

Performance

Receipt Types Supported

✅ Convenience store receipts (セブンイレブン、ローソン、ファミマ)
✅ Restaurant bills and café receipts
✅ Supermarket and grocery receipts
✅ Department store purchases
✅ Pharmacy and drugstore receipts
✅ Mixed Japanese/English text receipts

Text Recognition Capabilities

Japanese Scripts: Kanji, Hiragana, Katakana
Numerical Data: Prices, quantities, tax calculations
Dates & Times: Various Japanese date formats
Special Characters: Currency symbols (￥), percentages (%)

Usage

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "your-username/Japanese-Receipt-VL-3B-JSON"
)
processor = AutoProcessor.from_pretrained("your-username/Japanese-Receipt-VL-3B-JSON")

# Load receipt image
image = Image.open("japanese_receipt.jpg")

# Optimized instruction prompt for Japanese receipt extraction
instruct_prompt = """You are an intelligent document parser. Read the following Japanese receipt and extract every piece of information exactly as it appears, and present it in a well-structured JSON format using Japanese keys and values. Please strictly follow these rules: Only extract information that is actually present on the receipt. Do not include any missing, blank, or inferred fields. Do not summarize, omit, translate, or modify any part of the receipt. Every character, number, symbol, and line must be retained exactly as printed. Extract all available content including but not limited to: store details, receipt number, date, time, cashier name, product list, prices, tax breakdowns, payment details, receipt bags, barcodes, notices, and any footer messages. Preserve original formatting such as line breaks, symbols, and full-width characters (hiragana, katakana, kanji, numbers, etc.). Do not perform any translation, correction, interpretation, or reformatting of content. Use only what is present. Output the result in JSON format, using Japanese field names as keys."""

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": instruct_prompt}
        ]
    }
]

# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)

Recommended Preprocessing

from your_preprocessing import process_receipt_image

# Preprocess image for optimal results
processed_image = process_receipt_image(
    "receipt.jpg",
    target_width=640,
    target_height=896
)

Limitations

Current Limitations

Language: Specifically trained for Japanese receipts; maintains original Japanese text in output
Image Quality: Best results with clear, well-lit mobile photos
Receipt Layout: Optimized for standard Japanese retail receipt formats
Handwritten Text: Limited support for handwritten items or annotations
Damaged Receipts: May struggle with torn, folded, or severely faded receipts
Text Preservation: Designed to preserve exact text as printed; does not perform translation or correction

Technical Limitations

Input Resolution: Optimized for mobile-captured images (not high-DPI scans)
Sequence Length: 1,536 token limit may truncate very long receipts
JSON Structure: Fixed schema; doesn't adapt to unusual receipt formats

Ethical Considerations

Privacy & Data Protection

Personal Information: May extract personal data from receipts
Data Handling: Users responsible for complying with privacy regulations
Business Data: Consider data sensitivity in commercial applications

Bias & Fairness

Regional Bias: Trained primarily on specific Japanese retail formats
Store Type Bias: Performance may vary across different business types
Demographic Bias: Dataset reflects specific geographic collection area

Citation

@model{japanese-receipt-vl-3b-json,
  title={Japanese-Receipt-VL-3B-JSON: Fine-tuned Vision-Language Model for Japanese Receipt OCR},
  author={[Your Name]},
  year={2024},
  publisher={Hugging Face},
  journal={Hugging Face Model Hub},
  howpublished={\url{https://huggingface.co/your-username/Japanese-Receipt-VL-3B-JSON}},
  note={Trained on Japanese-Mobile-Receipt-OCR-1K dataset}
}

Acknowledgments

Base Model: Qwen2.5-VL-3B by Alibaba Cloud
Training Framework: Unsloth for efficient fine-tuning
Dataset: Japanese-Mobile-Receipt-OCR-1K - Custom collected Japanese receipt images (1,147 samples)
Community: Hugging Face transformers library

Model Card Authors

Sabari Nathan / Couger Inc,Japan

Model Card Contact

[email protected]

Tags: vision-language, japanese, ocr, receipt-processing, json-extraction, qwen2.5-vl, multimodal, fine-tuned

Downloads last month: 701

Safetensors

Model size

4B params

Tensor type

BF16

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sabaridsnfuji/Japanese-Receipt-VL-3B-JSON

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(561)

this model

Quantizations

1 model

sabaridsnfuji
/

Japanese-Receipt-VL-3B-JSON