Could you please share the script for doing SVD distillation?
Could you please share the script for doing SVD distillation?
https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c
I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/
https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c
I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/
https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammed
https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c
I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammed
It seems I was also deceived, but there should be some parts of the code itself that are worth learning.
https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c
I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammedIt seems I was also deceived, but there should be some parts of the code itself that are worth learning.
Yeah, I think so too. It seems plausible.
https://gist.github.com/StableFluffy/cfa24ce7d93e3c6b0d55d08b12f6f55c
I'm curious how the model in the original post improved performance, as your post suggests that the original distillation should be ineffective.
https://www.reddit.com/r/LocalLLaMA/comments/1o3kb3o/real_svd_glm45airglm46distill/https://www.reddit.com/r/LocalLLaMA/comments/1o0st2o/basedbaseqwen3coder30ba3binstruct480bdistillv2_is
I got scammedIt seems I was also deceived, but there should be some parts of the code itself that are worth learning.
Yeah, I think so too. It seems plausible.
Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?
Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?
It is almost identical with few changes at target layer and gpu optimization.
Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?It is almost identical with few changes at target layer and gpu optimization.
I failed to reproduce the issue. My script is functional, and instead of producing a meaningless LoRa or a faulty model, I discovered a few ridiculous points:
Ironically, the original author's method might have worked for dense models and only made mistakes in LoRa naming and merging, but he has been forced to leave the internet.
Continuing from above: the errors lie in the LoRa merging not being a valid script, and the LoRa naming being problematic.
I apologize to the original author; I reproduced it too late.
This is the available model, and you can check the tensor differences with the original student model.
https://huggingface.co/DMETEST/Qwen3-4B-Instruct-2507-distil-KAT-Dev-72B-Exp
Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?It is almost identical with few changes at target layer and gpu optimization.
I failed to reproduce the issue. My script is functional, and instead of producing a meaningless LoRa or a faulty model, I discovered a few ridiculous points:
Ironically, the original author's method might have worked for dense models and only made mistakes in LoRa naming and merging, but he has been forced to leave the internet.
Continuing from above: the errors lie in the LoRa merging not being a valid script, and the LoRa naming being problematic.
I apologize to the original author; I reproduced it too late.
This is the available model, and you can check the tensor differences with the original student model.
https://huggingface.co/DMETEST/Qwen3-4B-Instruct-2507-distil-KAT-Dev-72B-Exp
Interesting, can you share script?
Anyway, I will try to run a benchmark tomorrow to make sure that my MOE to Dense Distillation rewrite based on the original code actually works.
I think your implementation is almost identical to the original, maybe I'm mistaken?It is almost identical with few changes at target layer and gpu optimization.
I failed to reproduce the issue. My script is functional, and instead of producing a meaningless LoRa or a faulty model, I discovered a few ridiculous points:
Ironically, the original author's method might have worked for dense models and only made mistakes in LoRa naming and merging, but he has been forced to leave the internet.
Continuing from above: the errors lie in the LoRa merging not being a valid script, and the LoRa naming being problematic.
I apologize to the original author; I reproduced it too late.
This is the available model, and you can check the tensor differences with the original student model.
https://huggingface.co/DMETEST/Qwen3-4B-Instruct-2507-distil-KAT-Dev-72B-ExpInteresting, can you share script?
These still have bugs, but I'm not in the mood to fix them. Could you please help?
For example: Lora rank cannot be correctly obtained from the student model (this was a small oversight on my part).
https://github.com/win10ogod/LLM-distillation-scripts