File size: 12,828 Bytes
4f63fc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---

language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
tags:
- mergekit
- merge
- funny
- conversational
- text-generation
- chihuahua-powerful
- boolean-expression-champion
- math-avoider
- object-counting-struggler
base_model:
- Qwen/Qwen2.5-7B-Instruct
- fblgit/cybertron-v4-qw7B-MGS
- huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3
- FreedomIntelligence/HuatuoGPT-o1-7B
- rombodawg/Rombos-LLM-V2.5-Qwen-7b
model-index:
- name: Qwen2.5-THREADRIPPER-Small
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: HuggingFaceH4/ifeval
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 76.89
      name: strict accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: BBH
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 35.79
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: hendrycks/competition_math
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 47.36
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 8.05
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 13.93
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 37.3
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=Xiaojian9992024/Qwen2.5-THREADRIPPER-Small
      name: Open LLM Leaderboard
---

base_model:

- fblgit/cybertron-v4-qw7B-MGS

- huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3

- FreedomIntelligence/HuatuoGPT-o1-7B

- rombodawg/Rombos-LLM-V2.5-Qwen-7b

- Qwen/Qwen2.5-7B-Instruct

library_name: transformers
tags:
- mergekit
- merge
- funny
- conversational
- text-generation
- chihuahua-powerful
- boolean-expression-champion
- math-avoider
- object-counting-struggler

---
# Xiaojian9992024/Qwen2.5-THREADRIPPER-Small -  The "Small" is Just for Show (and Benchmarks)

## Model Description:

Behold, the Qwen2.5-THREADRIPPER-Small! Don't let the "Small" in the name fool you; this model is **compactly powerful**... in the same way a chihuahua is "compactly powerful."  We merged a bunch of models together using some fancy algorithm called "Linear DELLA" because, frankly, we thought it sounded cool.  Did it work?  Well...

Think of this model as the Frankenstein's monster of language models, but instead of being scary, it's just... kinda there.  It's built upon the mighty Qwen2.5-7B-Instruct, and then we threw in a cybernetic one, an "abliterated" one (we're not sure what that means either), one that's good at medical stuff (maybe it can diagnose your code?), and another one named "Rombos" because why not?

## Merge Details
### Merge Method

This model was merged using the [Linear DELLA](https://arxiv.org/abs/2406.11617) merge method using [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) as a base, because "Linear DELLA" sounded like it knew what it was doing. (Spoiler: jury's still out).

### Models Merged

We lovingly stitched together the following models to create this... unique entity:
* [fblgit/cybertron-v4-qw7B-MGS](https://huggingface.co/fblgit/cybertron-v4-qw7B-MGS) - For that cybernetic edge (we hope).
* [huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3](https://huggingface.co/huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3) -  "Abliterated" - sounds intense, right? We're banking on it.
* [FreedomIntelligence/HuatuoGPT-o1-7B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-7B) -  Maybe it can give your code a check-up? (Probably not).
* [rombodawg/Rombos-LLM-V2.5-Qwen-7b](https://huggingface.co/rombodawg/Rombos-LLM-V2.5-Qwen-7b) - Because every good merge needs a "Rombos".
* [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) -  Our solid, dependable base.  Relatively speaking.

### Benchmarks - Prepare for a Wild Ride (Mostly Downhill):

Okay, let's talk numbers. We ran this bad boy on the Open LLM Leaderboard, and the results are... well, they're numbers!  Don't stare directly at them for too long.

*   **MMLU "Pro": Accuracy: 43.5%**.  Yes, you read that right.  It's about as accurate as guessing, but hey, at least it's consistently around 43%! We're aiming for consistency here, folks. Participation trophies for everyone!
*   **BBH - Boolean Expressions: Accuracy (Normalized): 83.6%**.  BOOM!  Boolean expressions? Nailed it!  Ask it if "true and false or true" is true, and it'll get it right most of the time.  Boolean logic?  Bring it on!  Existential questions? Maybe not.
*   **BBH - Object Counting: Accuracy (Normalized): 33.6%**.  Counting objects?  Apparently, this is where our Threadripper trips over its own feet.  Maybe it needs glasses? Or perhaps objects are just inherently confusing, even for advanced AI. We blame the objects.
*   **BBH - Tracking Shuffled Objects (7 objects): Accuracy (Normalized): 14.4%**.  Seven objects?  Forget about it.  Three objects? Still bad (22.4%). Five objects?  Slightly less terrible (22%).  If you need something tracked, maybe use GPS, not this model.  Unless you're tracking Boolean values. Then we're golden.
*   **GPQA: Accuracy (Normalized): ~30%**.  GPQA? More like GP-"Q"-Maybe-"A".  It's trying its best, okay?  Lower your expectations.
*   **Math Hard (Algebra, Counting, Geometry, etc.): Exact Match: 0.0%**.  Zero. Zilch. Nada.  If you need help with your math homework, please, for the love of numbers, use a calculator. Or ask a human.  Or a very sophisticated pigeon.  Anything but this for math.  Seriously, *anything*.

### Intended Use:

*   **Conversational?** Sure, if you like conversations that are 43.5% accurate on general knowledge and amazing at Boolean expressions but can't count or track objects.  It's definitely *a* conversation starter... about the limitations of language models.
*   **Text Generation?** Absolutely! It generates text.  Whether that text is coherent, accurate, or helpful is another question entirely.  But it *does* generate text, and sometimes that's the best you can ask for.  Think of it as performance art.
*   **Funny Model Cards?**  Clearly, yes.  It excels at providing benchmark data that is hilarious when you try to spin it positively.  We're leaning into our strengths here.

### Limitations:

*   Math. Just... math.  Avoid math.  Run away from math.  If you even *think* about math, this model will give you a blank stare and possibly start reciting Boolean expressions for comfort.
*   Object counting and tracking.  Objects are its nemesis.  Especially when shuffled.  Or when there are more than two.  Actually, just avoid objects in general.  Stick to abstract concepts.
*   GPQA.  We're still not sure what GPQA is, and neither is the model, apparently.  It's a mystery for the ages.
*   May occasionally hallucinate benchmark scores that are *slightly* better than reality (we're working on our honesty module... or maybe not).

### How to Use:

Use responsibly?  Or irresponsibly, we're not your boss.  Just don't expect it to balance your checkbook or track your keys.  For Boolean expressions though?  It's your champion.  Need to know if "cat is animal AND animal has fur"?  This model's got you.

### Disclaimer:

Side effects of using this model may include: existential dread, questioning the nature of intelligence, and a sudden urge to count shuffled objects yourself to prove you're better than a language model.  Use at your own risk.  But hey, at least it's small! And sometimes, small and funny is all you need.

### Configuration

The following YAML configuration was used to produce this model (because you asked, not because we understand it):

```yaml

merge_method: della_linear

base_model: Qwen/Qwen2.5-7B-Instruct

dtype: bfloat16

parameters:

  epsilon: 0.015            # Fine-grain scaling for precision. (Sounds important!)

  lambda: 1.6               # Strong emphasis on top-performing models. (We aimed high!)

  normalize: true           # Stable parameter integration across models. (Stability is key, even if accuracy isn't)

adaptive_merge_parameters:

  task_weights:

    tinyArc: 1.75           # Logical reasoning. (For when you need *some* logic)

    tinyHellaswag: 1.65     # Contextual predictions. (It tries, bless its heart)

    tinyMMLU: 1.8           # Domain knowledge. (Limited domains, mostly Boolean expressions)

    tinyTruthfulQA: 2.0     # Prioritize truthful reasoning. (Truthful-ish. Mostly.)

    tinyTruthfulQA_mc1: 1.85 # Even more truthful-ishness!

    tinyWinogrande: 1.9     # Advanced reasoning and predictions. (Baby steps in advanced reasoning)

    IFEval: 2.1             # Instruction-following and multitasking. (Instructions followed loosely. Multitasking? Define "task".)

    BBH: 2.25                # Complex reasoning. (Boolean expressions are complex, right?)

    MATH: 2.4               # Mathematical reasoning. (Just kidding. See benchmarks.)

    GPQA: 2.35             # Factual QA. (Facts are... subjective?)

    MUSR: 2.3             # Multi-step reasoning. (One step at a time, maybe?)

    MMLU-PRO: 2.35       # Domain multitask performance. (Boolean expressions. We keep coming back to Boolean expressions.)

  smoothing_factor: 0.05     # TURN UP THE SMOOTH! (Smoothness is next to godliness, or at least accuracy)

models:

  - model: Qwen/Qwen2.5-7B-Instruct

    parameters:

      weight: 0.65 # The heavy lifter (relatively speaking)

      density: 0.65 # Dense with... something.

  - model: huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3

    parameters:

      weight: 0.1 # A touch of "abliteration" for flavor

      density: 0.1 # Just a sprinkle

  - model: rombodawg/Rombos-LLM-V2.5-Qwen-7b

    parameters:

      weight: 0.15 # Rombos-ness, now in model form!

      density: 0.15 # A dash more density

  - model: fblgit/cybertron-v4-qw7B-MGS

    parameters:

      weight: 0.05 # Cybertron! Pew pew! (Performance may vary)

      density: 0.05 # A smidgen of cybernetics

  - model: FreedomIntelligence/HuatuoGPT-o1-7B

    parameters:

      weight: 0.05 # Medical intelligence? Maybe?

      density: 0.05 # Homeopathic dose of medical knowledge

´´´

# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)

Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/Xiaojian9992024__Qwen2.5-THREADRIPPER-Small-details)



|      Metric       |Value|

|-------------------|----:|

|Avg.               |36.55|

|IFEval (0-Shot)    |76.89|

|BBH (3-Shot)       |35.79|

|MATH Lvl 5 (4-Shot)|47.36|

|GPQA (0-shot)      | 8.05|

|MuSR (0-shot)      |13.93|

|MMLU-PRO (5-shot)  |37.30|