Instructions to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rAIfle/WAIDWML-Phi4-8x14B-bf16") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("rAIfle/WAIDWML-Phi4-8x14B-bf16") model = AutoModelForCausalLM.from_pretrained("rAIfle/WAIDWML-Phi4-8x14B-bf16") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rAIfle/WAIDWML-Phi4-8x14B-bf16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rAIfle/WAIDWML-Phi4-8x14B-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rAIfle/WAIDWML-Phi4-8x14B-bf16
- SGLang
How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rAIfle/WAIDWML-Phi4-8x14B-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rAIfle/WAIDWML-Phi4-8x14B-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rAIfle/WAIDWML-Phi4-8x14B-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rAIfle/WAIDWML-Phi4-8x14B-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with Docker Model Runner:
docker model run hf.co/rAIfle/WAIDWML-Phi4-8x14B-bf16
WAIDWML - What Am I Doing With My Life?
(8 Phi-4s in a trenchcoat)
Rationale
So there I was, finding some inspiration to tune stuff but lacking the disposable funds to do anything with the larger models. Enter Phi-4, a model designed for productivity... Initially it was just a going to be a sequential series of finetunes, starting from the baseline Phi-4 and gradually adding more datasets until I either got bored or it got good, but then I had an idea; what if I just MoE'd it?
Yeah.
As a proof of concept, this wasn't too bad. The end result is... interesting, to say the least.
Training
As mentioned above, this was done in "phases", each with a separate dataset. Most were done with a max_seq_length of 32k, a few of them were dropped to 16k to make sure they fit in the hardware.
lr was all over the place but in general somewhere between 1e-5 and 4e-6. These were all separate LoRAs using r=64 and alpha=32 with rsLoRA enabled. epochs were 2 or 3 for everything except c2, as that'd take far too long.
p1: Private RP dataset (RPT-Varied-Small)p2:TheDrummer/AmoralQA-v2p3:AIRRC/Eudaimonicp4: Two private RP datasets (cc-gpt4-sfw-sharegpt&cc-gpt4-nsfw-sharegpt)p5: A random subset of the infamous "c2"-logs dataset, cleaned and deduped (approx. 30%)p6: Private RP dataset (RPT-Varied-Small_v1.5)p7:NewEden/PIPPA-Mega-Filteredp8:Squish42/bluemoon-fandom-1-1-rp-cleaned
(Note: the RPT-Varied-Small and RPT-Varied-Small_v1.5 datasets are due to be released after I manually verify their fitness.)
Once all LoRAs were trained, I separately merged them into the base model then I used mergekit (config) to "merge" them into a MoE. I chose to initialize the router randomly as I was going to training that part later. After that, I trained the routing layers for 8 epochs with lr = 1e-6 and grimulkan/LimaRP-augmented as the dataset. It took roughly 8.5 hours on a 6xA40 instance on RunPod.
Recommended Settings
Phi-4 format. What I used for my tests:
- Temp 1
- minP 0.05
FAQ
Q: Why not do anything constructive, like GRPO-tune a model of usable size?
A: Where's the fun in that?
Q: Are you, like, okay?
A: Objectively? Probably not. Subjectively? Never better.
Q: You know this still sucks for RP, right?
A: Yup. Should have pivoted to reasoning and code once R1 hit, but sunk cost and all kept me on this trajectory.
- Downloads last month
- 4