Instructions to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="rAIfle/WAIDWML-Phi4-8x14B-bf16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rAIfle/WAIDWML-Phi4-8x14B-bf16")
model = AutoModelForCausalLM.from_pretrained("rAIfle/WAIDWML-Phi4-8x14B-bf16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rAIfle/WAIDWML-Phi4-8x14B-bf16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rAIfle/WAIDWML-Phi4-8x14B-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rAIfle/WAIDWML-Phi4-8x14B-bf16

SGLang

How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rAIfle/WAIDWML-Phi4-8x14B-bf16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rAIfle/WAIDWML-Phi4-8x14B-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rAIfle/WAIDWML-Phi4-8x14B-bf16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rAIfle/WAIDWML-Phi4-8x14B-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use rAIfle/WAIDWML-Phi4-8x14B-bf16 with Docker Model Runner:
```
docker model run hf.co/rAIfle/WAIDWML-Phi4-8x14B-bf16
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

WAIDWML - What Am I Doing With My Life?

(8 Phi-4s in a trenchcoat)

Rationale

So there I was, finding some inspiration to tune stuff but lacking the disposable funds to do anything with the larger models. Enter Phi-4, a model designed for productivity... Initially it was just a going to be a sequential series of finetunes, starting from the baseline Phi-4 and gradually adding more datasets until I either got bored or it got good, but then I had an idea; what if I just MoE'd it?

Yeah.

As a proof of concept, this wasn't too bad. The end result is... interesting, to say the least.

Training

As mentioned above, this was done in "phases", each with a separate dataset. Most were done with a max_seq_length of 32k, a few of them were dropped to 16k to make sure they fit in the hardware.

lr was all over the place but in general somewhere between 1e-5 and 4e-6. These were all separate LoRAs using r=64 and alpha=32 with rsLoRA enabled. epochs were 2 or 3 for everything except c2, as that'd take far too long.

p1: Private RP dataset (RPT-Varied-Small)
p2: TheDrummer/AmoralQA-v2
p3: AIRRC/Eudaimonic
p4: Two private RP datasets (cc-gpt4-sfw-sharegpt & cc-gpt4-nsfw-sharegpt)
p5: A random subset of the infamous "c2"-logs dataset, cleaned and deduped (approx. 30%)
p6: Private RP dataset (RPT-Varied-Small_v1.5)
p7: NewEden/PIPPA-Mega-Filtered
p8: Squish42/bluemoon-fandom-1-1-rp-cleaned

(Note: the RPT-Varied-Small and RPT-Varied-Small_v1.5 datasets are due to be released after I manually verify their fitness.)

Once all LoRAs were trained, I separately merged them into the base model then I used mergekit (config) to "merge" them into a MoE. I chose to initialize the router randomly as I was going to training that part later. After that, I trained the routing layers for 8 epochs with lr = 1e-6 and grimulkan/LimaRP-augmented as the dataset. It took roughly 8.5 hours on a 6xA40 instance on RunPod.

Recommended Settings

Phi-4 format. What I used for my tests:

Temp 1
minP 0.05

FAQ

Q: Why not do anything constructive, like GRPO-tune a model of usable size?
A: Where's the fun in that?

Q: Are you, like, okay?
A: Objectively? Probably not. Subjectively? Never better.

Q: You know this still sucks for RP, right?
A: Yup. Should have pivoted to reasoning and code once R1 hit, but sunk cost and all kept me on this trajectory.

Downloads last month: 4

Safetensors

Model size

92B params

Tensor type

BF16

Model tree for rAIfle/WAIDWML-Phi4-8x14B-bf16

Base model

microsoft/phi-4

Finetuned

unsloth/phi-4

Finetuned

(92)

this model

Quantizations

2 models