xuebi commited on
Commit
a5c26e6
·
1 Parent(s): 2713e50
Files changed (2) hide show
  1. README.md +1 -0
  2. docs/mlx_deploy_guide.md +70 -0
README.md CHANGED
@@ -171,6 +171,7 @@ We recommend using [Transformers](https://github.com/huggingface/transformers) t
171
 
172
  ### Other Inference Engines
173
 
 
174
  - [KTransformers](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md)
175
 
176
  ### Inference Parameters
 
171
 
172
  ### Other Inference Engines
173
 
174
+ - [MLX-LM](./docs/mlx_deploy_guide.md)
175
  - [KTransformers](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/MiniMax-M2.1-Tutorial.md)
176
 
177
  ### Inference Parameters
docs/mlx_deploy_guide.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## MLX deployment guide
2
+
3
+ Run, serve, and fine-tune [**MiniMax-M2.1**](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) locally on your Mac using the **MLX** framework. This guide gets you up and running quickly.
4
+
5
+ > **Requirements**
6
+ > - Apple Silicon Mac (M3 Ultra or later)
7
+ > - **At least 256GB of unified memory (RAM)**
8
+
9
+
10
+ **Installation**
11
+
12
+ Install the `mlx-lm` package via pip:
13
+
14
+ ```bash
15
+ pip install -U mlx-lm
16
+ ```
17
+
18
+ **CLI**
19
+
20
+ Generate text directly from the terminal:
21
+
22
+ ```bash
23
+ mlx_lm.generate \
24
+ --model mlx-community/MiniMax-M2.1-4bit \
25
+ --prompt "How tall is Mount Everest?"
26
+ ```
27
+
28
+ > Add `--max-tokens 256` to control response length, or `--temp 0.7` for creativity.
29
+
30
+ **Python Script Example**
31
+
32
+ Use `mlx-lm` in your own Python scripts:
33
+
34
+ ```python
35
+ from mlx_lm import load, generate
36
+
37
+ # Load the quantized model
38
+ model, tokenizer = load("mlx-community/MiniMax-M2.1-4bit")
39
+
40
+ prompt = "Hello, how are you?"
41
+
42
+ # Apply chat template if available (recommended for chat models)
43
+ if tokenizer.chat_template is not None:
44
+ messages = [{"role": "user", "content": prompt}]
45
+ prompt = tokenizer.apply_chat_template(
46
+ messages,
47
+ tokenize=False,
48
+ add_generation_prompt=True
49
+ )
50
+
51
+ # Generate response
52
+ response = generate(
53
+ model,
54
+ tokenizer,
55
+ prompt=prompt,
56
+ max_tokens=256,
57
+ temp=0.7,
58
+ verbose=True
59
+ )
60
+
61
+ print(response)
62
+ ```
63
+
64
+ **Tips**
65
+ - **Model variants**: Check this [MLX community collection on Hugging Face](https://huggingface.co/collections/mlx-community/minimax-m2.1) for `MiniMax-M2.1-4bit`, `6bit`, `8bit`, or `bfloat16` versions.
66
+ - **Fine-tuning**: Use `mlx-lm.lora` for efficient parameter-efficient fine-tuning (PEFT).
67
+
68
+ **Resources**
69
+ - GitHub: [https://github.com/ml-explore/mlx-lm](https://github.com/ml-explore/mlx-lm)
70
+ - Models: [https://huggingface.co/mlx-community](https://huggingface.co/mlx-community)