Instructions to use robertobissanti/EngGPT2-16B-A3B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use robertobissanti/EngGPT2-16B-A3B-MLX-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("robertobissanti/EngGPT2-16B-A3B-MLX-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use robertobissanti/EngGPT2-16B-A3B-MLX-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "robertobissanti/EngGPT2-16B-A3B-MLX-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "robertobissanti/EngGPT2-16B-A3B-MLX-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use robertobissanti/EngGPT2-16B-A3B-MLX-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "robertobissanti/EngGPT2-16B-A3B-MLX-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default robertobissanti/EngGPT2-16B-A3B-MLX-4bit
Run Hermes
hermes
- MLX LM
How to use robertobissanti/EngGPT2-16B-A3B-MLX-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "robertobissanti/EngGPT2-16B-A3B-MLX-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "robertobissanti/EngGPT2-16B-A3B-MLX-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "robertobissanti/EngGPT2-16B-A3B-MLX-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
EngGPT2-16B-A3B MLX 4-bit
This repository contains a 4-bit MLX conversion of engineering-group/EngGPT2-16B-A3B.
This is the recommended version for local Apple Silicon use among the three published MLX conversions.
Important compatibility note
This model currently requires custom enggpt_moe support in mlx-lm.
Standard upstream mlx-lm may not load this model unless support for:
model_type = enggpt_moe
has been merged.
A patched mlx-lm fork is available here:
https://github.com/robertobissanti/mlx-lm-enggpt
A copy of the custom model implementation is also included in this repository:
custom_mlx/enggpt_moe.py
Until upstream support is available, users must either:
- install the patched fork, or
- copy
custom_mlx/enggpt_moe.pyinto their localmlx_lm/models/directory.
Original model
Base model:
engineering-group/EngGPT2-16B-A3B
The original model is a custom MoE decoder-only language model.
Approximate architecture:
- 24 decoder layers
- hidden size 2880
- 32 attention heads
- 4 key/value heads
- explicit head_dim = 128
- Q/K RMSNorm inside attention
- RoPE with rope_theta = 1000000.0
- 64 experts per MoE layer
- top-8 expert routing
- SwiGLU experts
- untied lm_head
Tested setup
Tested locally on:
- Mac Studio M1 Ultra
- 64 GB unified memory
- macOS Apple Silicon
- patched mlx-lm fork:
robertobissanti/mlx-lm-enggpt
Approximate local benchmark:
Version Disk size Generation speed Peak memory MLX 4-bit ~15 GB ~90–94 tok/s ~15.8 GB
Installation
Install the patched mlx-lm fork:
python3 -m venv .venv-mlx-enggpt
source .venv-mlx-enggpt/bin/activate
pip install -U pip
pip install git+https://github.com/robertobissanti/mlx-lm-enggpt.git
Download
hf download robertobissanti/EngGPT2-16B-A3B-MLX-4bit \
--local-dir EngGPT2-16B-A3B-MLX-4bit
Usage
python -m mlx_lm generate \
--model ./EngGPT2-16B-A3B-MLX-4bit \
--prompt "Spiegami in italiano, in due frasi, che cos'è un modello Mixture of Experts." \
--trust-remote-code \
--chat-template-config '{"enable_thinking": false}' \
--temp 0.1 \
--max-tokens 160
Local OpenAI-compatible server
python -m mlx_lm server \
--model ./EngGPT2-16B-A3B-MLX-4bit \
--host 127.0.0.1 \
--port 8080 \
--trust-remote-code \
--chat-template-args '{"enable_thinking": false}' \
--temp 0.1 \
--top-p 1.0 \
--max-tokens 512
Example request:
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "./EngGPT2-16B-A3B-MLX-4bit",
"messages": [
{
"role": "user",
"content": "Spiegami in italiano, in due frasi, che cos è un modello Mixture of Experts."
}
],
"temperature": 0.1,
"max_tokens": 160
}'
Open WebUI example
You can run this model through mlx_lm.server and connect Open WebUI to the local OpenAI-compatible endpoint.
Start the MLX server:
python -m mlx_lm server \
--model ./EngGPT2-16B-A3B-MLX-4bit \
--host 127.0.0.1 \
--port 8080 \
--trust-remote-code \
--chat-template-args '{"enable_thinking": false}' \
--temp 0.1 \
--top-p 1.0 \
--max-tokens 512
Start Open WebUI:
docker run -d \
--name open-webui \
-p 3000:8080 \
-e WEBUI_AUTH=false \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8080/v1 \
-e OPENAI_API_KEY=dummy \
-v open-webui:/app/backend/data \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:main
Then open:
http://localhost:3000/
When configuring the model, use the local model path as the model id, for example:
./EngGPT2-16B-A3B-MLX-4bit
or the absolute local path used when starting the server.
Chat template note
The tokenizer/chat template may emit blocks unless thinking is disabled.
Use:
{"enable_thinking": false}
With mlx_lm.generate:
--chat-template-config '{"enable_thinking": false}'
With mlx_lm.server:
--chat-template-args '{"enable_thinking": false}'
Known limitations
- Requires custom enggpt_moe support in mlx-lm.
- Does not currently work in standard LM Studio unless its backend includes enggpt_moe.
- Does not currently work in Ollama because no GGUF conversion exists.
- llama.cpp conversion currently fails with:
Model EngGPTMoeForCausalLM is not supported
- This is an experimental community conversion.
- The model may still require careful prompting for technical accuracy in long-form answers.
License
This repository is a derived MLX conversion of:
engineering-group/EngGPT2-16B-A3B
Please refer to the original model repository for license terms and usage restrictions.
The original model is distributed under the EngGPT Non-Commercial License. Commercial use may be restricted or prohibited by the original license.
- Downloads last month
- 61
4-bit
Model tree for robertobissanti/EngGPT2-16B-A3B-MLX-4bit
Base model
engineering-group/EngGPT2-16B-A3B