maywell/GLM-4.5-Air-GLM-4.6-Distill · Could you please share the script for doing SVD distillation?

win10

Oct 11

Could you please share the script for doing SVD distillation?

maywell

Owner Oct 11

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c

win10

Oct 11

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c

I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/

maywell

Owner Oct 11

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c

I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/

https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammed

win10

Oct 11

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c

I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/

https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammed

It seems I was also deceived, but there should be some parts of the code itself that are worth learning.

maywell

Owner Oct 11

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c

I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/

https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammed

It seems I was also deceived, but there should be some parts of the code itself that are worth learning.

Yeah, I think so too. It seems plausible.

win10

Oct 11

https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c

I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/

https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammed

It seems I was also deceived, but there should be some parts of the code itself that are worth learning.

Yeah, I think so too. It seems plausible.

Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?

maywell

Owner Oct 11

Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?

It is almost identical with few changes at target layer and gpu optimization.

win10

21 days ago

Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?

It is almost identical with few changes at target layer and gpu optimization.

I failed to reproduce the issue. My script is functional, and instead of producing a meaningless LoRa or a faulty model, I discovered a few ridiculous points:

Ironically, the original author's method might have worked for dense models and only made mistakes in LoRa naming and merging, but he has been forced to leave the internet.
Continuing from above: the errors lie in the LoRa merging not being a valid script, and the LoRa naming being problematic.
I apologize to the original author; I reproduced it too late.

This is the available model, and you can check the tensor differences with the original student model.
https://huggingface.co/DMETEST/Qwen3-4B-Instruct-2507-distil-KAT-Dev-72B-Exp

maywell

Owner 21 days ago

Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?

It is almost identical with few changes at target layer and gpu optimization.

I failed to reproduce the issue. My script is functional, and instead of producing a meaningless LoRa or a faulty model, I discovered a few ridiculous points:

Ironically, the original author's method might have worked for dense models and only made mistakes in LoRa naming and merging, but he has been forced to leave the internet.

Continuing from above: the errors lie in the LoRa merging not being a valid script, and the LoRa naming being problematic.

I apologize to the original author; I reproduced it too late.

This is the available model, and you can check the tensor differences with the original student model.
https://huggingface.co/DMETEST/Qwen3-4B-Instruct-2507-distil-KAT-Dev-72B-Exp

Interesting, can you share script?

win10

21 days ago

Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?

It is almost identical with few changes at target layer and gpu optimization.

I failed to reproduce the issue. My script is functional, and instead of producing a meaningless LoRa or a faulty model, I discovered a few ridiculous points:

Ironically, the original author's method might have worked for dense models and only made mistakes in LoRa naming and merging, but he has been forced to leave the internet.

Continuing from above: the errors lie in the LoRa merging not being a valid script, and the LoRa naming being problematic.

I apologize to the original author; I reproduced it too late.

This is the available model, and you can check the tensor differences with the original student model.
https://huggingface.co/DMETEST/Qwen3-4B-Instruct-2507-distil-KAT-Dev-72B-Exp

Interesting, can you share script?

These still have bugs, but I'm not in the mood to fix them. Could you please help?

For example: Lora rank cannot be correctly obtained from the student model (this was a small oversight on my part).
https://github.com/win10ogod/LLM-distillation-scripts