Japanese-Receipt-VL-3B-JSON
Model Description
Japanese-Receipt-VL-3B-JSON is a fine-tuned vision-language model based on Qwen2.5-VL-3B, specifically optimized for Japanese receipt OCR and structured data extraction. The model processes mobile phone-captured receipt images and outputs structured JSON containing store information, itemized purchases, tax calculations, and payment details.
Dataset: Trained on the Japanese-Mobile-Receipt-OCR-1K dataset.
Model Details
- Model Name: Japanese-Receipt-VL-3B-JSON
- Dataset Name: Japanese-Mobile-Receipt-OCR-1K
- Base Model: Qwen/Qwen2.5-VL-3B-Instruct
- Model Type: Vision-Language Model (Multimodal)
- Language: Japanese (preserves original text exactly as printed)
- License: Same as base model
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Images: 1,000 real-world Japanese receipt images
- Total Dataset: 1,147 collected images
- Output Format: Structured JSON with Japanese keys
- Extraction Approach: Comprehensive, exact preservation without translation or interpretation
Intended Use
Primary Use Cases
- Japanese retail receipt digitization
- Expense tracking and management systems
- Business accounting automation
- Mobile receipt scanning applications
- E-commerce and point-of-sale integrations
Input
- Mobile phone-captured images of Japanese receipts
- Supported formats: JPG, PNG
- Optimal resolution: 640-896px (portrait) or 896-640px (landscape)
Output
Structured JSON containing:
{
"店舗名": "セブンイレブン渋谷店",
"日付": "2024年01月15日",
"時刻": "14:30",
"レシートNo": "0001234",
"商品リスト": [
{
"商品名": "おにぎり鮭",
"数量": 1,
"単価": 128,
"金額": 128
}
],
"小計": 840,
"消費税": 84,
"合計": 924,
"支払方法": "現金",
"お預り": 1000,
"お釣り": 76
}
Prompting Guidelines
Optimized Instruction Prompt
This model was trained with a specific instruction prompt that ensures comprehensive and accurate extraction. For best results, use the exact prompt shown in the usage example above.
Key Prompt Features
- Comprehensive extraction: Captures all visible information including store details, timestamps, itemized products, tax calculations, and payment information
- Exact preservation: Maintains original Japanese text, formatting, and symbols without translation or modification
- JSON structure: Uses Japanese field names as keys for cultural and linguistic consistency
- No inference: Only extracts explicitly visible information, avoiding assumptions or corrections
- Complete coverage: Includes often-missed elements like receipt numbers, cashier names, footer messages, and promotional text
Sample Results
Example Input
Example Output
{
"領収書": {
"店舗情報": [
{
"店名": "業務スーパー河内屋",
"場所": "青葉台店",
"電話": "045-985-9603"
}
]
},
"登録番号": "T7011701002269",
"メッセージ": {
"内容": "プロの品質とプロの価格 業務スーパーでは毎日のお買い得となっています。"
},
"発行日": {
"月日年分類": "72025年 7月12日",
"担当者": "土田"
},
"商品情報": [
{
"名前": "鶏卵赤玉MSP",
"個数": "10個入り",
"価格": "¥237"
},
{
"名前": "マカロニ(セダーニーニー)500G。",
"個数": "¥138",
"商品詳細": "021 4901995400119"
},
{
"名前": "JUCO VIA(業)チュータースライサイズ",
"価格": "¥2418"
},
{
"名前": "協同牛乳 酪農牛乳 1L",
"価格": "¥199",
"商品詳細": "021 496168110063"
},
{
"名前": "おかも納豆極小粒ミニ",
"価格": "¥76"
},
{
"名前": "★協同牛乳",
"個数": "2418",
"商品詳細": "021 4901160010143"
}
],
"税情報": {
"轻減税率対象": {
"金額": "1,068",
"割合": "%",
"外税買上額": "¥1,068",
"外税額": "¥85",
"税額合計": "¥85"
}
},
"合計情報": {
"金額": {
"合計": "¥1,153",
"外税合計": {
"金額": "1,068",
"割合": "%",
"外税合計": "¥1,068",
"外税合計割合": "%",
"税額合計": "¥85"
}
},
"お預り": "¥5,000",
"お釣り": "¥3,847"
},
"備考": {
"内容": "本問保管上のお願い財布・手帳等にはさんで保管頂く場合は、印刷面を内側に折り保管をお願いします"
},
"バーコード": {
"6829": "6821",
"日付": "11:20"
}
}
This example demonstrates the model's ability to:
- Extract complete store information including registration numbers
- Capture all product details with prices and barcodes
- Parse complex tax calculations and reduced tax rates
- Preserve promotional messages and handling instructions
- Maintain exact Japanese formatting and terminology
Training Details
Dataset
- Source: Custom collected real-world Japanese receipts
- Collection Method: Mobile phone photography in various lighting conditions
- Training Split: 1,000 images
- Validation Split: 147 images
- Receipt Types: Convenience stores, restaurants, supermarkets, department stores
- Image Preprocessing: Balanced resize (640x896 target), aspect ratio preservation
Training Configuration
- Sequence Length: 1,536 tokens (optimized for Japanese text)
- Batch Size: 1 (with gradient accumulation)
- Learning Rate: 5e-6 (conservative for stability)
- LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Dropout: 0.1
- Target modules: Attention layers (q_proj, k_proj, v_proj, o_proj)
- Training Hardware: Tesla T4 GPU
- Training Framework: Unsloth + HuggingFace Transformers
Training Optimizations
- Custom image preprocessing for mobile receipt photos
- Balanced aspect ratio handling (portrait/landscape)
- Even dimension sizing to prevent tokenization issues
- Conservative training parameters to avoid NaN issues
- Japanese text tokenization optimizations
Performance
Receipt Types Supported
- ✅ Convenience store receipts (セブンイレブン、ローソン、ファミマ)
- ✅ Restaurant bills and café receipts
- ✅ Supermarket and grocery receipts
- ✅ Department store purchases
- ✅ Pharmacy and drugstore receipts
- ✅ Mixed Japanese/English text receipts
Text Recognition Capabilities
- Japanese Scripts: Kanji, Hiragana, Katakana
- Numerical Data: Prices, quantities, tax calculations
- Dates & Times: Various Japanese date formats
- Special Characters: Currency symbols (¥), percentages (%)
Usage
Basic Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"your-username/Japanese-Receipt-VL-3B-JSON"
)
processor = AutoProcessor.from_pretrained("your-username/Japanese-Receipt-VL-3B-JSON")
# Load receipt image
image = Image.open("japanese_receipt.jpg")
# Optimized instruction prompt for Japanese receipt extraction
instruct_prompt = """You are an intelligent document parser. Read the following Japanese receipt and extract every piece of information exactly as it appears, and present it in a well-structured JSON format using Japanese keys and values. Please strictly follow these rules: Only extract information that is actually present on the receipt. Do not include any missing, blank, or inferred fields. Do not summarize, omit, translate, or modify any part of the receipt. Every character, number, symbol, and line must be retained exactly as printed. Extract all available content including but not limited to: store details, receipt number, date, time, cashier name, product list, prices, tax breakdowns, payment details, receipt bags, barcodes, notices, and any footer messages. Preserve original formatting such as line breaks, symbols, and full-width characters (hiragana, katakana, kanji, numbers, etc.). Do not perform any translation, correction, interpretation, or reformatting of content. Use only what is present. Output the result in JSON format, using Japanese field names as keys."""
# Prepare input
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": instruct_prompt}
]
}
]
# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)
Recommended Preprocessing
from your_preprocessing import process_receipt_image
# Preprocess image for optimal results
processed_image = process_receipt_image(
"receipt.jpg",
target_width=640,
target_height=896
)
Limitations
Current Limitations
- Language: Specifically trained for Japanese receipts; maintains original Japanese text in output
- Image Quality: Best results with clear, well-lit mobile photos
- Receipt Layout: Optimized for standard Japanese retail receipt formats
- Handwritten Text: Limited support for handwritten items or annotations
- Damaged Receipts: May struggle with torn, folded, or severely faded receipts
- Text Preservation: Designed to preserve exact text as printed; does not perform translation or correction
Technical Limitations
- Input Resolution: Optimized for mobile-captured images (not high-DPI scans)
- Sequence Length: 1,536 token limit may truncate very long receipts
- JSON Structure: Fixed schema; doesn't adapt to unusual receipt formats
Ethical Considerations
Privacy & Data Protection
- Personal Information: May extract personal data from receipts
- Data Handling: Users responsible for complying with privacy regulations
- Business Data: Consider data sensitivity in commercial applications
Bias & Fairness
- Regional Bias: Trained primarily on specific Japanese retail formats
- Store Type Bias: Performance may vary across different business types
- Demographic Bias: Dataset reflects specific geographic collection area
Citation
@model{japanese-receipt-vl-3b-json,
title={Japanese-Receipt-VL-3B-JSON: Fine-tuned Vision-Language Model for Japanese Receipt OCR},
author={[Your Name]},
year={2024},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/your-username/Japanese-Receipt-VL-3B-JSON}},
note={Trained on Japanese-Mobile-Receipt-OCR-1K dataset}
}
Acknowledgments
- Base Model: Qwen2.5-VL-3B by Alibaba Cloud
- Training Framework: Unsloth for efficient fine-tuning
- Dataset: Japanese-Mobile-Receipt-OCR-1K - Custom collected Japanese receipt images (1,147 samples)
- Community: Hugging Face transformers library
Model Card Authors
Sabari Nathan / Couger Inc,Japan
Model Card Contact
Tags: vision-language, japanese, ocr, receipt-processing, json-extraction, qwen2.5-vl, multimodal, fine-tuned
- Downloads last month
- 701
