BeeGass commited on
Commit
b570c40
·
verified ·
1 Parent(s): 380c94c

Fix dataset config: remove features sections and use data-* patterns to exclude cache files

Browse files
Files changed (1) hide show
  1. README.md +784 -0
README.md ADDED
@@ -0,0 +1,784 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ task_categories:
4
+ - text-generation
5
+ language:
6
+ - en
7
+ tags:
8
+ - mathematics
9
+ - group-theory
10
+ - permutations
11
+ - symbolic-reasoning
12
+ - algebra
13
+ - sequence-modeling
14
+ - state-space-models
15
+ - computational-complexity
16
+ pretty_name: Group Theory Collection
17
+ size_categories:
18
+ - 10M<n<100M
19
+ configs:
20
+ - config_name: default
21
+ data_files:
22
+ - split: train
23
+ path: data/*/train/data-*
24
+ - split: test
25
+ path: data/*/test/data-*
26
+ - config_name: s3
27
+ data_files:
28
+ - split: train
29
+ path: data/s3/train/data-*
30
+ - split: test
31
+ path: data/s3/test/data-*
32
+ - config_name: s4
33
+ data_files:
34
+ - split: train
35
+ path: data/s4/train/data-*
36
+ - split: test
37
+ path: data/s4/test/data-*
38
+ - config_name: s5
39
+ data_files:
40
+ - split: train
41
+ path: data/s5/train/data-*
42
+ - split: test
43
+ path: data/s5/test/data-*
44
+ - config_name: s6
45
+ data_files:
46
+ - split: train
47
+ path: data/s6/train/data-*
48
+ - split: test
49
+ path: data/s6/test/data-*
50
+ - config_name: s7
51
+ data_files:
52
+ - split: train
53
+ path: data/s7/train/data-*
54
+ - split: test
55
+ path: data/s7/test/data-*
56
+ - config_name: s8
57
+ data_files:
58
+ - split: train
59
+ path: data/s8/train/data-*
60
+ - split: test
61
+ path: data/s8/test/data-*
62
+ - config_name: s9
63
+ data_files:
64
+ - split: train
65
+ path: data/s9/train/data-*
66
+ - split: test
67
+ path: data/s9/test/data-*
68
+ - config_name: a3
69
+ data_files:
70
+ - split: train
71
+ path: data/a3/train/data-*
72
+ - split: test
73
+ path: data/a3/test/data-*
74
+ - config_name: a4
75
+ data_files:
76
+ - split: train
77
+ path: data/a4/train/data-*
78
+ - split: test
79
+ path: data/a4/test/data-*
80
+ - config_name: a5
81
+ data_files:
82
+ - split: train
83
+ path: data/a5/train/data-*
84
+ - split: test
85
+ path: data/a5/test/data-*
86
+ - config_name: a6
87
+ data_files:
88
+ - split: train
89
+ path: data/a6/train/data-*
90
+ - split: test
91
+ path: data/a6/test/data-*
92
+ - config_name: a7
93
+ data_files:
94
+ - split: train
95
+ path: data/a7/train/data-*
96
+ - split: test
97
+ path: data/a7/test/data-*
98
+ - config_name: a8
99
+ data_files:
100
+ - split: train
101
+ path: data/a8/train/data-*
102
+ - split: test
103
+ path: data/a8/test/data-*
104
+ - config_name: a9
105
+ data_files:
106
+ - split: train
107
+ path: data/a9/train/data-*
108
+ - split: test
109
+ path: data/a9/test/data-*
110
+ - config_name: c2
111
+ data_files:
112
+ - split: train
113
+ path: data/c2/train/data-*
114
+ - split: test
115
+ path: data/c2/test/data-*
116
+ - config_name: c3
117
+ data_files:
118
+ - split: train
119
+ path: data/c3/train/data-*
120
+ - split: test
121
+ path: data/c3/test/data-*
122
+ - config_name: c4
123
+ data_files:
124
+ - split: train
125
+ path: data/c4/train/data-*
126
+ - split: test
127
+ path: data/c4/test/data-*
128
+ - config_name: c5
129
+ data_files:
130
+ - split: train
131
+ path: data/c5/train/data-*
132
+ - split: test
133
+ path: data/c5/test/data-*
134
+ - config_name: c6
135
+ data_files:
136
+ - split: train
137
+ path: data/c6/train/data-*
138
+ - split: test
139
+ path: data/c6/test/data-*
140
+ - config_name: c7
141
+ data_files:
142
+ - split: train
143
+ path: data/c7/train/data-*
144
+ - split: test
145
+ path: data/c7/test/data-*
146
+ - config_name: c8
147
+ data_files:
148
+ - split: train
149
+ path: data/c8/train/data-*
150
+ - split: test
151
+ path: data/c8/test/data-*
152
+ - config_name: c9
153
+ data_files:
154
+ - split: train
155
+ path: data/c9/train/data-*
156
+ - split: test
157
+ path: data/c9/test/data-*
158
+ - config_name: c10
159
+ data_files:
160
+ - split: train
161
+ path: data/c10/train/data-*
162
+ - split: test
163
+ path: data/c10/test/data-*
164
+ - config_name: c11
165
+ data_files:
166
+ - split: train
167
+ path: data/c11/train/data-*
168
+ - split: test
169
+ path: data/c11/test/data-*
170
+ - config_name: c12
171
+ data_files:
172
+ - split: train
173
+ path: data/c12/train/data-*
174
+ - split: test
175
+ path: data/c12/test/data-*
176
+ - config_name: c13
177
+ data_files:
178
+ - split: train
179
+ path: data/c13/train/data-*
180
+ - split: test
181
+ path: data/c13/test/data-*
182
+ - config_name: c14
183
+ data_files:
184
+ - split: train
185
+ path: data/c14/train/data-*
186
+ - split: test
187
+ path: data/c14/test/data-*
188
+ - config_name: c15
189
+ data_files:
190
+ - split: train
191
+ path: data/c15/train/data-*
192
+ - split: test
193
+ path: data/c15/test/data-*
194
+ - config_name: c16
195
+ data_files:
196
+ - split: train
197
+ path: data/c16/train/data-*
198
+ - split: test
199
+ path: data/c16/test/data-*
200
+ - config_name: c17
201
+ data_files:
202
+ - split: train
203
+ path: data/c17/train/data-*
204
+ - split: test
205
+ path: data/c17/test/data-*
206
+ - config_name: c18
207
+ data_files:
208
+ - split: train
209
+ path: data/c18/train/data-*
210
+ - split: test
211
+ path: data/c18/test/data-*
212
+ - config_name: c19
213
+ data_files:
214
+ - split: train
215
+ path: data/c19/train/data-*
216
+ - split: test
217
+ path: data/c19/test/data-*
218
+ - config_name: c20
219
+ data_files:
220
+ - split: train
221
+ path: data/c20/train/data-*
222
+ - split: test
223
+ path: data/c20/test/data-*
224
+ - config_name: c21
225
+ data_files:
226
+ - split: train
227
+ path: data/c21/train/data-*
228
+ - split: test
229
+ path: data/c21/test/data-*
230
+ - config_name: c22
231
+ data_files:
232
+ - split: train
233
+ path: data/c22/train/data-*
234
+ - split: test
235
+ path: data/c22/test/data-*
236
+ - config_name: c23
237
+ data_files:
238
+ - split: train
239
+ path: data/c23/train/data-*
240
+ - split: test
241
+ path: data/c23/test/data-*
242
+ - config_name: c24
243
+ data_files:
244
+ - split: train
245
+ path: data/c24/train/data-*
246
+ - split: test
247
+ path: data/c24/test/data-*
248
+ - config_name: c25
249
+ data_files:
250
+ - split: train
251
+ path: data/c25/train/data-*
252
+ - split: test
253
+ path: data/c25/test/data-*
254
+ - config_name: c26
255
+ data_files:
256
+ - split: train
257
+ path: data/c26/train/data-*
258
+ - split: test
259
+ path: data/c26/test/data-*
260
+ - config_name: c27
261
+ data_files:
262
+ - split: train
263
+ path: data/c27/train/data-*
264
+ - split: test
265
+ path: data/c27/test/data-*
266
+ - config_name: c28
267
+ data_files:
268
+ - split: train
269
+ path: data/c28/train/data-*
270
+ - split: test
271
+ path: data/c28/test/data-*
272
+ - config_name: c29
273
+ data_files:
274
+ - split: train
275
+ path: data/c29/train/data-*
276
+ - split: test
277
+ path: data/c29/test/data-*
278
+ - config_name: c30
279
+ data_files:
280
+ - split: train
281
+ path: data/c30/train/data-*
282
+ - split: test
283
+ path: data/c30/test/data-*
284
+ - config_name: d3
285
+ data_files:
286
+ - split: train
287
+ path: data/d3/train/data-*
288
+ - split: test
289
+ path: data/d3/test/data-*
290
+ - config_name: d4
291
+ data_files:
292
+ - split: train
293
+ path: data/d4/train/data-*
294
+ - split: test
295
+ path: data/d4/test/data-*
296
+ - config_name: d5
297
+ data_files:
298
+ - split: train
299
+ path: data/d5/train/data-*
300
+ - split: test
301
+ path: data/d5/test/data-*
302
+ - config_name: d6
303
+ data_files:
304
+ - split: train
305
+ path: data/d6/train/data-*
306
+ - split: test
307
+ path: data/d6/test/data-*
308
+ - config_name: d7
309
+ data_files:
310
+ - split: train
311
+ path: data/d7/train/data-*
312
+ - split: test
313
+ path: data/d7/test/data-*
314
+ - config_name: d8
315
+ data_files:
316
+ - split: train
317
+ path: data/d8/train/data-*
318
+ - split: test
319
+ path: data/d8/test/data-*
320
+ - config_name: d9
321
+ data_files:
322
+ - split: train
323
+ path: data/d9/train/data-*
324
+ - split: test
325
+ path: data/d9/test/data-*
326
+ - config_name: d10
327
+ data_files:
328
+ - split: train
329
+ path: data/d10/train/data-*
330
+ - split: test
331
+ path: data/d10/test/data-*
332
+ - config_name: d11
333
+ data_files:
334
+ - split: train
335
+ path: data/d11/train/data-*
336
+ - split: test
337
+ path: data/d11/test/data-*
338
+ - config_name: d12
339
+ data_files:
340
+ - split: train
341
+ path: data/d12/train/data-*
342
+ - split: test
343
+ path: data/d12/test/data-*
344
+ - config_name: d13
345
+ data_files:
346
+ - split: train
347
+ path: data/d13/train/data-*
348
+ - split: test
349
+ path: data/d13/test/data-*
350
+ - config_name: d14
351
+ data_files:
352
+ - split: train
353
+ path: data/d14/train/data-*
354
+ - split: test
355
+ path: data/d14/test/data-*
356
+ - config_name: d15
357
+ data_files:
358
+ - split: train
359
+ path: data/d15/train/data-*
360
+ - split: test
361
+ path: data/d15/test/data-*
362
+ - config_name: d16
363
+ data_files:
364
+ - split: train
365
+ path: data/d16/train/data-*
366
+ - split: test
367
+ path: data/d16/test/data-*
368
+ - config_name: d17
369
+ data_files:
370
+ - split: train
371
+ path: data/d17/train/data-*
372
+ - split: test
373
+ path: data/d17/test/data-*
374
+ - config_name: d18
375
+ data_files:
376
+ - split: train
377
+ path: data/d18/train/data-*
378
+ - split: test
379
+ path: data/d18/test/data-*
380
+ - config_name: d19
381
+ data_files:
382
+ - split: train
383
+ path: data/d19/train/data-*
384
+ - split: test
385
+ path: data/d19/test/data-*
386
+ - config_name: d20
387
+ data_files:
388
+ - split: train
389
+ path: data/d20/train/data-*
390
+ - split: test
391
+ path: data/d20/test/data-*
392
+ - config_name: q8
393
+ data_files:
394
+ - split: train
395
+ path: data/q8/train/data-*
396
+ - split: test
397
+ path: data/q8/test/data-*
398
+ - config_name: q16
399
+ data_files:
400
+ - split: train
401
+ path: data/q16/train/data-*
402
+ - split: test
403
+ path: data/q16/test/data-*
404
+ - config_name: q32
405
+ data_files:
406
+ - split: train
407
+ path: data/q32/train/data-*
408
+ - split: test
409
+ path: data/q32/test/data-*
410
+ - config_name: f20
411
+ data_files:
412
+ - split: train
413
+ path: data/f20/train/data-*
414
+ - split: test
415
+ path: data/f20/test/data-*
416
+ - config_name: f21
417
+ data_files:
418
+ - split: train
419
+ path: data/f21/train/data-*
420
+ - split: test
421
+ path: data/f21/test/data-*
422
+ - config_name: v4
423
+ data_files:
424
+ - split: train
425
+ path: data/v4/train/data-*
426
+ - split: test
427
+ path: data/v4/test/data-*
428
+ - config_name: z2_1
429
+ data_files:
430
+ - split: train
431
+ path: data/z2_1/train/data-*
432
+ - split: test
433
+ path: data/z2_1/test/data-*
434
+ - config_name: z2_2
435
+ data_files:
436
+ - split: train
437
+ path: data/z2_2/train/data-*
438
+ - split: test
439
+ path: data/z2_2/test/data-*
440
+ - config_name: z2_3
441
+ data_files:
442
+ - split: train
443
+ path: data/z2_3/train/data-*
444
+ - split: test
445
+ path: data/z2_3/test/data-*
446
+ - config_name: z2_4
447
+ data_files:
448
+ - split: train
449
+ path: data/z2_4/train/data-*
450
+ - split: test
451
+ path: data/z2_4/test/data-*
452
+ - config_name: z2_5
453
+ data_files:
454
+ - split: train
455
+ path: data/z2_5/train/data-*
456
+ - split: test
457
+ path: data/z2_5/test/data-*
458
+ - config_name: z3_1
459
+ data_files:
460
+ - split: train
461
+ path: data/z3_1/train/data-*
462
+ - split: test
463
+ path: data/z3_1/test/data-*
464
+ - config_name: z3_2
465
+ data_files:
466
+ - split: train
467
+ path: data/z3_2/train/data-*
468
+ - split: test
469
+ path: data/z3_2/test/data-*
470
+ - config_name: z3_3
471
+ data_files:
472
+ - split: train
473
+ path: data/z3_3/train/data-*
474
+ - split: test
475
+ path: data/z3_3/test/data-*
476
+ - config_name: z3_4
477
+ data_files:
478
+ - split: train
479
+ path: data/z3_4/train/data-*
480
+ - split: test
481
+ path: data/z3_4/test/data-*
482
+ - config_name: z5_1
483
+ data_files:
484
+ - split: train
485
+ path: data/z5_1/train/data-*
486
+ - split: test
487
+ path: data/z5_1/test/data-*
488
+ - config_name: z5_2
489
+ data_files:
490
+ - split: train
491
+ path: data/z5_2/train/data-*
492
+ - split: test
493
+ path: data/z5_2/test/data-*
494
+ - config_name: z5_3
495
+ data_files:
496
+ - split: train
497
+ path: data/z5_3/train/data-*
498
+ - split: test
499
+ path: data/z5_3/test/data-*
500
+ - config_name: z5_4
501
+ data_files:
502
+ - split: train
503
+ path: data/z5_4/train/data-*
504
+ - split: test
505
+ path: data/z5_4/test/data-*
506
+ - config_name: psl2_2
507
+ data_files:
508
+ - split: train
509
+ path: data/psl2_2/train/data-*
510
+ - split: test
511
+ path: data/psl2_2/test/data-*
512
+ - config_name: psl2_3
513
+ data_files:
514
+ - split: train
515
+ path: data/psl2_3/train/data-*
516
+ - split: test
517
+ path: data/psl2_3/test/data-*
518
+ - config_name: psl2_4
519
+ data_files:
520
+ - split: train
521
+ path: data/psl2_4/train/data-*
522
+ - split: test
523
+ path: data/psl2_4/test/data-*
524
+ - config_name: psl2_5
525
+ data_files:
526
+ - split: train
527
+ path: data/psl2_5/train/data-*
528
+ - split: test
529
+ path: data/psl2_5/test/data-*
530
+ - config_name: psl2_7
531
+ data_files:
532
+ - split: train
533
+ path: data/psl2_7/train/data-*
534
+ - split: test
535
+ path: data/psl2_7/test/data-*
536
+ - config_name: psl2_8
537
+ data_files:
538
+ - split: train
539
+ path: data/psl2_8/train/data-*
540
+ - split: test
541
+ path: data/psl2_8/test/data-*
542
+ - config_name: psl2_9
543
+ data_files:
544
+ - split: train
545
+ path: data/psl2_9/train/data-*
546
+ - split: test
547
+ path: data/psl2_9/test/data-*
548
+ - config_name: psl2_11
549
+ data_files:
550
+ - split: train
551
+ path: data/psl2_11/train/data-*
552
+ - split: test
553
+ path: data/psl2_11/test/data-*
554
+ - config_name: psl3_2
555
+ data_files:
556
+ - split: train
557
+ path: data/psl3_2/train/data-*
558
+ - split: test
559
+ path: data/psl3_2/test/data-*
560
+ - config_name: psl3_3
561
+ data_files:
562
+ - split: train
563
+ path: data/psl3_3/train/data-*
564
+ - split: test
565
+ path: data/psl3_3/test/data-*
566
+ - config_name: psl3_4
567
+ data_files:
568
+ - split: train
569
+ path: data/psl3_4/train/data-*
570
+ - split: test
571
+ path: data/psl3_4/test/data-*
572
+ - config_name: psl3_5
573
+ data_files:
574
+ - split: train
575
+ path: data/psl3_5/train/data-*
576
+ - split: test
577
+ path: data/psl3_5/test/data-*
578
+ - config_name: m11
579
+ data_files:
580
+ - split: train
581
+ path: data/m11/train/data-*
582
+ - split: test
583
+ path: data/m11/test/data-*
584
+ - config_name: m12
585
+ data_files:
586
+ - split: train
587
+ path: data/m12/train/data-*
588
+ - split: test
589
+ path: data/m12/test/data-*
590
+ ---
591
+
592
+ # Group Theory Collection
593
+
594
+ A comprehensive collection of permutation composition datasets for various mathematical groups, organized by computational complexity classes. This dataset is designed for studying the "Illusion of State" phenomenon in state-space models and transformer architectures.
595
+
596
+ ## Overview
597
+
598
+ This dataset provides 94 individual permutation group datasets spanning 10 different group families, systematically organized to facilitate research on the computational boundaries between solvable and non-solvable groups. The organization reflects the fundamental distinction between TC⁰-computable (solvable groups) and NC¹-complete (non-solvable groups) problems.
599
+
600
+ ### Research Motivation
601
+
602
+ Recent theoretical work demonstrates that TC⁰ models, including Transformers and standard State-Space Models (SSMs), cannot solve NC¹-complete problems such as composing permutations in non-solvable groups. This dataset enables researchers to:
603
+
604
+ - Empirically verify theoretical computational complexity boundaries
605
+ - Study the "Illusion of State" phenomenon in neural architectures
606
+ - Benchmark mathematical reasoning capabilities of sequence models
607
+ - Investigate generalization patterns across different group structures
608
+ - Analyze the relationship between model architecture and algebraic computation
609
+
610
+ ## Dataset Structure
611
+
612
+ The dataset is organized in three complementary ways to support different research approaches:
613
+
614
+ ### 1. Flat Organization (data/)
615
+ All 94 individual group datasets are available for direct access in a flat structure, facilitating straightforward loading and comparison across groups.
616
+
617
+ ### 2. TC⁰ Complexity Class (TC0/)
618
+ Contains 58 solvable groups that can theoretically be computed by constant-depth threshold circuits. These groups serve as positive controls where current neural architectures should succeed.
619
+
620
+ ### 3. NC¹ Complexity Class (NC1/)
621
+ Contains 36 non-solvable groups requiring logarithmic-depth circuits for computation. These groups represent problems that are provably beyond the computational capacity of TC⁰ models.
622
+
623
+ ## Usage
624
+
625
+ ### Basic Loading
626
+
627
+ ```python
628
+ from datasets import load_dataset
629
+
630
+ # Load specific group datasets using config names
631
+ s5_data = load_dataset("BeeGass/Group-Theory-Collection", name="s5")
632
+ a4_data = load_dataset("BeeGass/Group-Theory-Collection", name="a4")
633
+ m11_data = load_dataset("BeeGass/Group-Theory-Collection", name="m11")
634
+
635
+ # Alternative: Load from data directories
636
+ s5_data = load_dataset("BeeGass/Group-Theory-Collection", data_dir="data/s5")
637
+ tc0_cyclic = load_dataset("BeeGass/Group-Theory-Collection", data_dir="TC0/c10")
638
+ nc1_symmetric = load_dataset("BeeGass/Group-Theory-Collection", data_dir="NC1/s7")
639
+
640
+ # Access train/test splits
641
+ train_data = s5_data["train"]
642
+ test_data = s5_data["test"]
643
+ ```
644
+
645
+ ### Data Format
646
+
647
+ Each example contains the following fields:
648
+
649
+ ```python
650
+ {
651
+ 'input_sequence': "123 456 789 ...", # Space-separated permutation IDs (variable length)
652
+ 'target': "234", # Result of composition as string
653
+ 'sequence_length': 512, # Length of input sequence (varies from 3 to 1024)
654
+ 'group_degree': 7, # Degree of the permutation group (e.g., S7 acts on 7 elements)
655
+ 'group_order': 5040, # Order (size) of the group (e.g., |S7| = 7!)
656
+ 'group_type': "symmetric" # Type of the group
657
+ }
658
+ ```
659
+
660
+ Note: Sequences contain a variable number of permutation IDs (uniformly distributed between 3 and 1024). The provided target is the composition of all permutations in the input sequence.
661
+
662
+ ### Working with Different Sequence Lengths
663
+
664
+ The dataset already contains sequences of varying lengths (3 to 1024). You can filter or analyze based on sequence length:
665
+
666
+ ```python
667
+ # Load full dataset
668
+ dataset = load_dataset("BeeGass/Group-Theory-Collection", name="s5")
669
+
670
+ # Example: Filter for specific sequence lengths
671
+ short_sequences = dataset['train'].filter(lambda x: x['sequence_length'] <= 32)
672
+ medium_sequences = dataset['train'].filter(lambda x: 32 < x['sequence_length'] <= 256)
673
+ long_sequences = dataset['train'].filter(lambda x: x['sequence_length'] > 256)
674
+
675
+ # Analyze sequence length distribution
676
+ import numpy as np
677
+ lengths = np.array(dataset['train']['sequence_length'])
678
+ print(f"Min length: {lengths.min()}, Max length: {lengths.max()}")
679
+ print(f"Mean length: {lengths.mean():.1f}, Std: {lengths.std():.1f}")
680
+ ```
681
+
682
+ ## Group Inventory
683
+
684
+ ### TC⁰ Groups (Solvable) - 58 Groups
685
+
686
+ | Group Family | Groups | Orders | Mathematical Properties |
687
+ |--------------|--------|--------|------------------------|
688
+ | Symmetric | S3, S4 | 6, 24 | Solvable for n ≤ 4 |
689
+ | Alternating | A3, A4 | 3, 12 | Solvable for n ≤ 4 |
690
+ | Cyclic | C2-C30 (all) | 2-30 | Abelian groups |
691
+ | Dihedral | D3-D20 (all) | 6-40 | Symmetries of regular polygons |
692
+ | Klein | V4 | 4 | Smallest non-cyclic abelian group (isomorphic to Z₂²) |
693
+ | Quaternion | Q8, Q16, Q32 | 8, 16, 32 | Non-abelian 2-groups |
694
+ | Elementary Abelian | Z2^[1-5], Z3^[1-4], Z5^[1-4] | Various | Direct products of cyclic groups |
695
+ | Frobenius | F20, F21 | 20, 21 | Transitive permutation groups |
696
+ | Projective Special Linear | PSL(2,2), PSL(2,3) | 6, 12 | Solvable PSL groups |
697
+
698
+ ### NC¹ Groups (Non-Solvable) - 36 Groups
699
+
700
+ | Group Family | Groups | Orders | Mathematical Properties |
701
+ |--------------|--------|--------|------------------------|
702
+ | Symmetric | S5, S6, S7, S8, S9 | 120-362,880 | Non-solvable for n ≥ 5 |
703
+ | Alternating | A5, A6, A7, A8, A9 | 60-181,440 | Simple groups for n ≥ 5 |
704
+ | Projective Special Linear | PSL(2,4), PSL(2,5), PSL(2,7), PSL(2,8), PSL(2,9), PSL(2,11), PSL(3,2), PSL(3,3), PSL(3,4), PSL(3,5) | Various | Simple groups (PSL(2,4) ≅ A5) |
705
+ | Mathieu | M11, M12 | 7,920, 95,040 | Sporadic simple groups |
706
+
707
+ ## Technical Specifications
708
+
709
+ ### Permutation Representation
710
+ - Each permutation is assigned a unique integer identifier within its group
711
+ - Mappings between IDs and permutation arrays are consistent across train/test splits
712
+ - Permutation composition follows right-to-left convention (standard in mathematics)
713
+
714
+ ### Dataset Statistics
715
+ - **Train/Test Split**: 80/20 ratio for all groups
716
+ - **Sequence Lengths**: Variable lengths from 3 to 1024 permutations per example
717
+ - **File Format**: Apache Arrow for efficient data loading and memory mapping
718
+ - **Total Size**: Varies by group order and maximum sequence length
719
+
720
+ ### Composition Convention
721
+ For an input sequence [p₁, p₂, p₃], the target is computed as:
722
+ - Mathematical notation: p₃ ∘ p₂ ∘ p₁
723
+ - Operational interpretation: First apply p₁, then p₂, then p₃
724
+
725
+ ## Dataset Generation
726
+
727
+ The code used to generate this dataset is available at [https://github.com/BeeGass/Group-Dataset-Generator](https://github.com/BeeGass/Group-Dataset-Generator). The repository includes:
728
+
729
+ - Complete implementation of all permutation groups
730
+ - Dataset generation scripts with configurable parameters
731
+ - Verification and testing utilities
732
+ - Documentation for extending the dataset with additional groups
733
+
734
+ ## Research Applications
735
+
736
+ This dataset supports various research directions:
737
+
738
+ 1. **Computational Complexity Theory**: Empirical validation of TC⁰/NC¹ separation in neural networks
739
+ 2. **State-Space Model Analysis**: Testing fundamental limitations of linear recurrent architectures
740
+ 3. **Transformer Architecture Studies**: Investigating attention mechanism constraints
741
+ 4. **Mathematical Reasoning**: Benchmarking symbolic manipulation capabilities
742
+ 5. **Generalization Studies**: Cross-length and cross-group generalization patterns
743
+ 6. **Representation Learning**: Understanding how models encode algebraic structures
744
+
745
+ ## Citation
746
+
747
+ When using this dataset in academic work, please cite:
748
+
749
+ ```bibtex
750
+ @dataset{gass2024permutation,
751
+ author = {Gass, Bryan},
752
+ title = {Group Theory Collection},
753
+ year = {2024},
754
+ publisher = {Hugging Face},
755
+ url = {https://huggingface.co/datasets/BeeGass/Group-Theory-Collection},
756
+ note = {Organized by computational complexity classes (TC⁰/NC¹)}
757
+ }
758
+
759
+ @software{gass2024generator,
760
+ author = {Gass, Bryan},
761
+ title = {Group Dataset Generator},
762
+ year = {2024},
763
+ url = {https://github.com/BeeGass/Group-Dataset-Generator}
764
+ }
765
+
766
+ @article{merrill2024illusion,
767
+ title = {The Illusion of State in State-Space Models},
768
+ author = {Merrill, William and Jackson, Ashish and Goldstein, Yoav and Weiss, Gail and Angluin, Dana},
769
+ journal = {arXiv preprint arXiv:2404.08819},
770
+ year = {2024}
771
+ }
772
+ ```
773
+
774
+ ## Acknowledgments
775
+
776
+ This dataset was inspired by the theoretical work of William Merrill and colleagues on "The Illusion of State in State-Space Models" (arXiv:2404.08819), which establishes fundamental computational limitations of state-space models through group-theoretic analysis.
777
+
778
+ ## License
779
+
780
+ This dataset is released under the MIT License.
781
+
782
+ ## Contact
783
+
784
+ For questions, issues, or contributions, please use the Hugging Face dataset repository's discussion forum or contact Bryan Gass directly.