Is it possible to integrate qwen3vl into the model as CLIP?

#109
by q27916430 - opened

Is it possible to integrate qwen3vl into the model as CLIP? During my testing, I found that many issues encountered were due to unclear descriptions in qwen2.5vl. For example, a widely confusing incident involved the mix-up of male and female genitalia. This problem occurred precisely because qwen2.5VL uniformly referred to both male and female organs as "genitalia" in its descriptions, leading to such confusion.

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

I trained a LoRa model myself to address this issue. I carefully selected images for the training set and used Qwen3VL for tagging. The LoRa training was very successful—when generating images of males or females individually, whether prompted or not, there was no confusion, and no fissures appeared on the scrotum. However, as soon as two people are depicted together, the model becomes completely confused, with no logical pattern to the errors. When I revisited the issue and used 2.5VL to generate image descriptions, I discovered this problem. No matter how much I emphasized or restricted the conditions, 2.5VL refused to make any changes.

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

I trained a LoRa model myself to address this issue. I carefully selected images for the training set and used Qwen3VL for tagging. The LoRa training was very successful—when generating images of males or females individually, whether prompted or not, there was no confusion, and no fissures appeared on the scrotum. However, as soon as two people are depicted together, the model becomes completely confused, with no logical pattern to the errors. When I revisited the issue and used 2.5VL to generate image descriptions, I discovered this problem. No matter how much I emphasized or restricted the conditions, 2.5VL refused to make any changes.

Moreover, the issue with my LoRa is different from others. It doesn't manifest as fissures on the scrotum, but rather as both figures simultaneously developing male organs or both developing female organs. Currently, I've added a set of comparative images featuring both male and female figures to the dataset to see if this can directly resolve the problem.

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

Perhaps it would be better to use the Qwen 2.5 VL NSFW model as CLIP?https://huggingface.co/mradermacher/Qwen2.5-VL-7B-NSFW-Caption-V4-GGUF/tree/main?not-for-all-audiences=true

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

Perhaps it would be better to use the Qwen 2.5 VL NSFW model as CLIP?https://huggingface.co/mradermacher/Qwen2.5-VL-7B-NSFW-Caption-V4-GGUF/tree/main?not-for-all-audiences=true

Actually, I have been following this clip for a long time, but I don't know why. After using it as a clip, the sampler always reports errors.

try https://huggingface.co/Phil2Sat/Qwen-Image-Edit-Rapid-AIO-GGUF/tree/main/Qwen2.5-VL-7B-Instruct-abliterated
it has fixed metadata to run with city96's gguf loader read model card for Instructions

you NEED mmproj else no VL

try https://huggingface.co/Phil2Sat/Qwen-Image-Edit-Rapid-AIO-GGUF/tree/main/Qwen2.5-VL-7B-Instruct-abliterated
it has fixed metadata to run with city96's gguf loader read model card for Instructions

you NEED mmproj else no VL

I used this model, but it's hard to describe in words. Its crazy writing style is unacceptable to me. Using it as a marker, whether it can be replicated in the end is definitely a very big question.

try https://huggingface.co/Phil2Sat/Qwen-Image-Edit-Rapid-AIO-GGUF/tree/main/Qwen2.5-VL-7B-Instruct-abliterated
it has fixed metadata to run with city96's gguf loader read model card for Instructions

you NEED mmproj else no VL

I could never figure out how to get this to work, as I'd love to find a more capable text encoder for NSFW. I'm not sure if this abliterated version is any better, though, at this particular genitalia mixup.

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

I trained a LoRa model myself to address this issue. I carefully selected images for the training set and used Qwen3VL for tagging. The LoRa training was very successful—when generating images of males or females individually, whether prompted or not, there was no confusion, and no fissures appeared on the scrotum. However, as soon as two people are depicted together, the model becomes completely confused, with no logical pattern to the errors. When I revisited the issue and used 2.5VL to generate image descriptions, I discovered this problem. No matter how much I emphasized or restricted the conditions, 2.5VL refused to make any changes.

Moreover, the issue with my LoRa is different from others. It doesn't manifest as fissures on the scrotum, but rather as both figures simultaneously developing male organs or both developing female organs. Currently, I've added a set of comparative images featuring both male and female figures to the dataset to see if this can directly resolve the problem.

Can you share your LORA just for testing? Is it available somewhere?

try https://huggingface.co/Phil2Sat/Qwen-Image-Edit-Rapid-AIO-GGUF/tree/main/Qwen2.5-VL-7B-Instruct-abliterated
it has fixed metadata to run with city96's gguf loader read model card for Instructions

you NEED mmproj else no VL

I could never figure out how to get this to work, as I'd love to find a more capable text encoder for NSFW. I'm not sure if this abliterated version is any better, though, at this particular genitalia mixup.

From what I gather, this isn't possible, as Qwen Image Edit was designed to work with Qwen 2.5 VL. If anyone finds a way otherwise, let me know!

The mix up specifically about genitalia might be related to the LORAs, not necessarily the text encoder.

I trained a LoRa model myself to address this issue. I carefully selected images for the training set and used Qwen3VL for tagging. The LoRa training was very successful—when generating images of males or females individually, whether prompted or not, there was no confusion, and no fissures appeared on the scrotum. However, as soon as two people are depicted together, the model becomes completely confused, with no logical pattern to the errors. When I revisited the issue and used 2.5VL to generate image descriptions, I discovered this problem. No matter how much I emphasized or restricted the conditions, 2.5VL refused to make any changes.

Moreover, the issue with my LoRa is different from others. It doesn't manifest as fissures on the scrotum, but rather as both figures simultaneously developing male organs or both developing female organs. Currently, I've added a set of comparative images featuring both male and female figures to the dataset to see if this can directly resolve the problem.

Can you share your LORA just for testing? Is it available somewhere?

Please understand that it's not that I'm unwilling to share my LoRA. This LoRA is merely an experimental product—I used minimal data and a high learning rate to force it to overfit quickly, making it unsuitable for practical application. Beyond making male genitalia appear smooth and maintaining anatomical accuracy, it serves no other function. Its characters and actions are oversimplified and tend to undermine model consistency. That said, based on my experimental results, I believe my approach has potential. I am currently training a more versatile and functional LoRA, and I hope to succeed. Once it's ready, I will be happy to share it.

Sign up or log in to comment