Qwen/Qwen3-VL-32B-Instruct · GGUFs have buggy Japanese OCR performance

7 days ago

•

Edit: The latest llama.cpp seems to have fixed this.

I'm posting this here instead of a specific GGUF repo as this affects all GGUFs.

I've been testing the Qwen3 VL 32B GGUFs with Japanese OCR tasks and they've been having issues while the same model on Qwen Chat is able to OCR the text correctly. Here's an example to demonstrate:

Prompt:
OCR the text in this image.

Edit: Try the below prompt instead, it makes the difference more apparent
Transcribe the text in this image.

First, the correct OCR from Qwen3 VL 32B on Qwen Chat:

ざこ
ざこ
ざこざこ
くだらない存在
あわれだね

However, when the GGUF conversion of the model is given the same task it produces incorrect OCR for the first 2 text bubbles. There are a few characters it might say it is instead, but they're all wrong.

Qwen3 VL 32B Instruct:

ごめん
ごめん
ざこざこ
くだらない存在
あわれだね

or

ご
ご
ざこざこ
くだらない存在
あわれだね

Giving the same task to Qwen3 VL 32B Thinking is even more revealing:

So, let's look at the image. It's a pink background with white speech bubbles containing Japanese text. Let's read each bubble one by one.

First bubble: ごき (but wait, the characters are ごき? Wait no, let's check. The first bubble has ごき? Wait, no, the characters are ごき? Wait, no, looking at the image: first bubble is ごき? Wait, no, the first two characters are ご and き? Wait, no, let's see. Wait, the first bubble: the text is ごき? Wait, no, the first bubble's text is ごき? Wait, no, the first bubble has two characters: ご and き? Wait, no, let's check again.

Wait, the first speech bubble: the text is ごき? Wait, no, the first bubble: the characters are ご and き, but written as ごき? Wait, no, in the image, the first bubble has ごき? Wait, no, let's look carefully. The first bubble: the text is ごき? Wait, no, the first two characters are ご and き, but maybe it's ごき. Wait, no, let's list each bubble:

1. First bubble: ごき (but wait, the order: in Japanese, the characters are written in the bubble. Let's check each bubble's text.

First bubble: ごき → but wait, no, the first bubble's text is ごき? Wait, no, looking at the image, the first speech bubble has ご and き, so ごき. Wait, no, the first bubble: the characters are ご and き, so ごき. Wait, but let's check the actual image. The first bubble: the text is ごき? Wait, no, the first bubble's text is ごき. Wait, no, the first two characters: ご and き, so ごき. Wait, no, the first bubble's text is

As you can see in the reasoning block the model is really struggling to parse the first characters and loops infinitely.

Key Points:

I've also tested GGUFs of other models in the Qwen3 VL series, such as 30B-A3B and 8B dense. They're able to OCR the text perfectly, so this is likely an issue specific to the 32B version.
This occurs on all GGUFs by all GGUF providers (both text model and mmproj), including those I converted myself. I even tried to use different llama.cpp binaries, including official releases and those I built myself. There is no improvement. So this is likely an issue further up.
All testing was done with Q8_0 quants (f16/bf16 mmproj) except for once when I downloaded unsloth's Q5_XL quant to see if that helped (it didn't).
I used a variety of sampler settings when testing.
I only tested Japanese OCR, but my intuition is that if there are issues with this then there are probably issues with other things as well.

Those are my findings, and I hope they are able to help the Qwen team in some way. This is a pretty quick test to perform, so I encourage others to try it and see for themselves.

TPH441

6 days ago

Update:

I've tested the model with a bf16 GGUF + bf16 mmproj, which should be lossless. It still failed to OCR the first 2 text bubbles correctly and made the same errors, so this is not an error induced by quantization.

danielhanchen

Qwen org 6 days ago

Just tried your prompt and image in Qwenchat 3 times and they all got the same results as the results you've been getting for the GGUFs. You may have been lucky to get the correct result that time. The model seems to be just like that

CC: @TPH441

TPH441

6 days ago

Thank you for testing! @danielhanchen

I tested it on qwen chat a bit more, and out of 10 attempts it got it right 2 times and wrong 8 times. Example of it getting it right:

I think there is still a bit of a difference though, since the version on qwen chat is able to get it right sometimes while I have not been able to get a gguf to get it right a single time despite probably giving it 50 prompts at this point across ggufs from different providers.

Also I tested the thinking version on qwen chat and it got it right 4/5 times, while the gguf I tested simply looped.

To be fair though I did the testing on the thinking version before you released your thinking chat template fixes. So I'll test those quants at q8_0 and report back.

TPH441

6 days ago

I tested unsloth's Qwen3-VL-32B-Thinking quant at Q8_0 (bf16 mmproj) using the recommended settings in their docs. No system prompt.

Good news: It no longer loops!
Bad news:

...it got it right 0/5 times.

MrDragonFox

6 days ago

fp16 local on vllm checkpoint from about 9h ago post merge

TPH441

6 days ago

•

edited 6 days ago

@MrDragonFox Neat! That looks correct to me.

Also, I looked through my old Qwen Chat history. I tested it much more a couple of days ago, although I used a slightly different prompt then.

Back then the instruct 32B got it right 16/18 times.

I tested it again 3 times on qwen chat and it got it right 3/3 times, so it seems like that one word change in the prompt makes a huge difference.
On qwen chat at least. Locally on Q8_0 it still gets it wrong even with the prompt difference.

Edited OP to reflect this prompt instead as it makes the difference more apparent.

TPH441

6 days ago

Latest llama.cpp seems to have fixed it!

shimmyshimmer

Qwen org 6 days ago

Latest llama.cpp seems to have fixed it!

Oh fantastic! This doesn't mean we have to reconvert ggufs right?

TPH441

6 days ago

•

edited 5 days ago

@shimmyshimmer The correct OCR above was produced on a gguf I converted before qwen3 support was even officially merged. A newly converted qwen3 vl gguf also had the same hash as a gguf I converted days ago, so no you don't need to reconvert ggufs.