Papers
arxiv:2510.08404

Single layer tiny Co^4 outpaces GPT-2 and GPT-BERT

Published on Oct 9
Authors:
,

Abstract

A single-layer, two-head Co$^4$ machine with 8M parameters outperforms GPT-2 and GPT-BERT in the BabyLM Challenge, demonstrating superior training efficiency and strong zero-shot and fine-tuning performance on SuperGLUE tasks.

AI-generated summary

We show that a tiny Co^4 machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of O(N) (where N is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, O(N^2)) and GPT-BERT (30M, 12 layers, O(N^2)) in just two epochs, while both are trained for ten. Co^4 achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co^4 exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co^4 outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.08404 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.08404 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.08404 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.