What on earth did you guys do
Bro.. what's this?
Before I dove deeper into it, when I heard about this release I was extremely skeptical (naturally and reasonably so, because how on Earth is an 8B going to go toe-to-toe with models that're infinitely larger than itself on so many different benchmarks, especially considering that the architectural changes aren't that different from your average LLM). But then I read the technical paper, the verified sources, the benchmarks (and especially those that were able to be verified here on HuggingFace), and from the looks of things.. this shit is actually legit.
Like, I always knew that MoEs were the future and are clearly much better in every way compared to dense models, but I didn't know we would be reaching this level of performance so early on (as in much smaller model beating much bigger model). A while ago I was ecstatic over the invention of different architectures like HRMs (Hierarchical Reasoning Models), TRMs (Tiny Recursive Models), and DLMs (Diffusion Language Models), but got quickly disappointed once I saw that they never went anywhere and quickly faded into obscurity.
This however is actually insane and is a literal game-changer (the idea that such drastic changes stemmed from a better and more logical training regimen, a higher amount of parameters to the MoE router... etc). I wanna congratulate the company and the entire development team because how can you release a monster like this and not receive any praise? The industry needs more people like you, you guys are clearly making history and I thrive to become like you!
A little reality check. This model is grossly overfit to a handful of domains. For example, despite having 8b parameters this model has notably less broad knowledge than 1b models like Llama 1b. Not only that, its instruction following is also worse than 1b models, such as not listing the actor and character names as requested below.
I'll use the most watched TV shows of all time (Friends and The Big Bang Theory) to make my point. Even tiny models get this one right, but this model not only vomits hallucinations, but didn't follow basic instructions (only listed the character names, not actors).
The prompts are:
"What are the 6 main characters, and the actors who portrayed them, on the TV show Friends? Don't add details, just list them. And what year did the show first air?"
"What are the six main characters, and the actors who played them, on the show The Big Bang Theory? Don't add details, just list them. And what year did the show first air?"
The responses are:
"Joey Ribisi, Monica Geller, Chandler Bing, Phoebe Buffay, Rachel Green, and Ross Geller.
Friends premiered on September 22, 1994."
"Six main characters:
Stephen Hawking
Sheldon Cooper
Howard Wolowitz
Bernadette Rostenkowski
Jimmy LeBlanc
Andy Packing
First air year: 2007"
Only listing the character names despite clearly requesting the actors who portrayed them is an astonishing boneheaded instruction following error. I've tested well over 100 models, and even the horrendously bad 1st gen Falcon 7b didn't make this mistake, not to mention 1b models.
And Stephen Hawking, Jimmy LeBlanc, Andy Packing, and Joey Ribisi are absurd hallucinations.
This isn't an AI model. It's just an overfit tool for a handful of domains. No model with 8 BILLION parameters should have less broad knowledge and worse instruction following than Falcon 7b and modern 1b models.
A little reality check. This model is grossly overfit to a handful of domains. For example, despite having 8b parameters this model has notably less broad knowledge than 1b models like Llama 1b. Not only that, its instruction following is also worse than 1b models, such as not listing the actor and character names as requested below.
(...)
This isn't an AI model. It's just an overfit tool for a handful of domains. No model with 8 BILLION parameters should have less broad knowledge and worse instruction following than Falcon 7b and modern 1b models.
The reality check is that I don't really care about the Big Bang Theory and these little series ngl. Another reality check is that if you just read the papers and the model's page (second paragraph) you'll literally see the following statement: "ZAYA1-8B excels at detailed long-form reasoning especially for mathematical and coding task".
Let's repeat it again for emphasis:
ZAYA1-8B excels at detailed long-form reasoning especially for mathematical and coding task
- First sentence, second paragraph of this model's card
^What further proves this is that if you look at the model card again and check the benchmarks that were used to display its performance, you'll realize that the large majority of them are coding, math, scientific knowledge related (only 2 benchmarks are for creative writing and whatever). So clearly, it was made for the purpose that its authors outlined.
So essentially this whole situation is: A very small model that was specifically trained to be good on fields of logic and reasoning like math, coding and science (and miraculously to the point that it even reaches absurdly large models levels of performance, even beating MiMo V2.5 at some benchmarks, a model with 1 trillion parameters btw) isn't good at something that it wasn't specifically trained for. Color me surprised 🙄
The whole point of this release is to demonstrate the insane feat that a very small model like this can match extremely big models (again, picking the MiMo example, Zaya scored 74.2 on GPQA, MiMo V2.5 Pro scored 68.8. The size difference between these 2 models are at least of 125), a fact that was literally impossible since the beginning of the AI revolution in late 2022, something ridiculously unimaginable and crazy to think about.
So my point is, its like someone buying a very small gun that was specifically made for closed-ranged combat, and here you are complaining that you can't shoot 2km with it (like bro, it was never made for that). if you really want it to be broad in its knowledge, I recommend you grab its weights (the base weights are available) and finetune it to your hearts content on whatever Big Bang datasets you wanna train it in. This is clearly an AI model, and for what it was made (again, reasoning and logic, broad knowledge was never intended), its pretty damn good and effective (It'd say Godlike even).
@amarosnithe I'll make this very simple so even you can understand it. If a model repeatedly only list characters when clearly instructed to also list the actors who portrayed them, then the model isn't capable of even a drop of generalized reasoning, let alone long form reasoning.
An 8b model might sound small, but that's 8 BILLION parameters, and the resultant model is far larger than a core database of the world's most popular knowledge. So to not only vomit hallucinations about the main casts of the top 50 most watched TV shows in human history, but to also repeatedly fail to list the actors as instructed, is not OK.
Primarily what's happening is that they trained on countless examples from only a handful of domains, hence increasing the odds that a pre-packaged nearest match from the training data that roughly aligns with the user's prompt is increased. And the flood of thinking tokens primarily serves to mitigating retrieval errors of said pre-packaged nearest matches.
This simply isn't an AI model. At most it's an AI tool. If an LLM doesn't even have superficial broad popular knowledge, the ability to follow simple instructions, general popular abilities (e.g. story writing) and so on then it's just a grossly overfit tool. Any model maker can trade broad knowledge and abilities to boost performance in select domains at a given parameter count. That's just overfitting. The real challenge is retaining broad knowledge and abilities while improving things like reasoning, coding, and math.
This is the funniest and most bizarre level of self-entitlement, delusion and arrogance I've ever seen in this website yet. Congratulations, seems like you're making history in your way I guess.
@amarosnithe I'll make this very simple so even you can understand it.
Please do your majesty.
If a model repeatedly only list characters when clearly instructed to also list the actors who portrayed them, then the model isn't capable of even a drop of generalized reasoning, let alone long form reasoning.
The only criticism you have here that's valid is that it was unable to list actors alongside characters when it was clearly capable of doing so, meaning that it ignored your instructions and simply focused on the characters (which fair, that sucks, it should've payed attention to the whole request instead of just a snippet). But EVEN IF it also listed actors, you'd move your goalpost and complain about the actors being wrong as well (which it would inherently be, the model wasn't trained on TV data... so it will hallucinate. 🤯).
This is just yet again another instance of you confusing generalization with specialization. Generalized knowledge was never the point. Its like talking shit on a calculator because it can't tell you who Barack Obama was. But it can do some pretty amazing math for you if you want, pretty neat no? No? Guess not, knowing Barack Obama is truly the metric of all intelligence.
Are you also gonna go around the platform and cry about how some medical models suck because they're unable to tell you who said the word "Bazinga" or some shit? I guess Llama 3.1 sucks too now because this outdated ass model with a knowledge cutoff of 2023 can't reference this one obscure character from this one show from 2026 or some shit. Like, what are we doing here man?
An 8b model might sound small, but that's 8 BILLION parameters, and the resultant model is far larger than a core database of the world's most popular knowledge. So to not only vomit hallucinations about the main casts of the top 50 most watched TV shows in human history, but to also repeatedly fail to list the actors as instructed, is not OK.
"Errmm.. if the model who was clearly not made for this can't even list the names of heckin characters I like then its not ok!!!".
Its 8 BILLION parameters versus HUNDREDS, EVEN TRILLIONS of parameters. That's REALLY fucking small in comparison, so small in fact that 8B isn't even enough to cover the activation tokens of some of the models that it clearly surpasses. So small in fact that you can easily run this on a 12GB hardware at 8-bit precision (and 6GB at fp4.. and even these estimates are inaccurate because sure, you have to load the entire model into memory somehow, but its activation token is only 700M, so you can load it into your SSD or some shit and have the inference only consume A SINGLE GB of your GPU at fp8, thats just how efficient and small this is in comparison, I don't need 5 different datacenters to run a trillion parameter model just to get a similar level of expertise in the stuff I need it for. You can run a single GB in a PHONE, maybe even a smartwatch if you're just that fancy).
This is also the last time I'm gonna repeat myself, you're making a big deal about it being unable to list your precious sitcoms when it clearly wasn't made for that, and you're using that as a metric to judge the model's worth. Which is incredibly childish and narrow-minded. "Omg the science model doesn't know TV references!! This is clearly the END OF THE WORLD!!!"
Primarily what's happening is that they trained on countless examples from only a handful of domains, hence increasing the odds that a pre-packaged nearest match from the training data that roughly aligns with the user's prompt is increased. And the flood of thinking tokens primarily serves to mitigating retrieval errors of said pre-packaged nearest matches.
Translation: They trained it on some stuff. But not on the stuff I like.
Yes, that's correct. They focused on the stuff they're personally interested in, as in maths, code and science. It doesn't mean the model is terrible or not intelligent for that, because the model excels at what it was made to do, so its intelligent.
Definition of "Intelligence" from the Dictionary of Cambrige:
"the ability to learn, understand, and make judgments or have opinions that are based on reason"
The model learned what he was supposed to, understood it, and made judgements based on reason (what he was explicitly trained on). If I make a robot that can bake cookies, and another robot that can make cocaine, its nonsensical to call the one who bakes cookies "dumb" just because it can't make cocaine (something that requires an even greater amount of steps, thinking and precision), and vice-versa. The robot who makes cookies excels at it, so its competent and intelligent. It excels at what he was built for, so its good. I don't know what else to tell you here, the robot who makes cocaine has to pour gasoline onto the mix of coca leaves very delicately. You can't expect the cookie robot to do the same just because it knows the concept of "pouring things", its going to pour all the gasoline at once and ruin the leaves because thats how it pours chocolate onto cookiedough, its intelligence is misused in this situation.
This simply isn't an AI model. At most it's an AI tool. If an LLM doesn't even have superficial broad popular knowledge, the ability to follow simple instructions, general popular abilities (e.g. story writing) and so on then it's just a grossly overfit tool. Any model maker can trade broad knowledge and abilities to boost performance in select domains at a given parameter count. That's just overfitting. The real challenge is retaining broad knowledge and abilities while improving things like reasoning, coding, and math.
And there it is. You have no idea how insanely idiotic this mindset is. An AI model is a machine, a cluster of neurons, a mechanical brain (hence "Artificial"), that does stuff on its own (and is considered smart for it, some would use the word "intelligence" to describe this phenomenon even). Anything that is a cluster of neurons that autonomously does stuff, is an AI model. An AI model can clearly suck and be wrong, but its still an AI model. By your insane logic, you're dismissing the largest portion of all LLMs in the entire world as "not an AI model" because you say so. The model made for medicine who doesn't know how to do story writing is clearly not an AI model you guys. The math model who earned 50 different prizes in every single prestigious university in the world and owned all the human chuds is not an AI model because it can't tell me who Anne Hathaway is you guys. The vision robotics model who can see shit, and was made to recognize objects but not reference me some Marvel movie quote is not an AI model you guys... these are just "grossly overfit tools" unless they have broad knowledge and are clearly capable of referencing me the funni sitcom because I say so and I'm the world authority in what constitutes an "AI model" (in spite of what the entire field Machine Learning says, because I'm just that super duper awesome sauce, I'm the coolest fish in the aquarium, I'm THAT cat my g).
Fuck Alan Turing, the Turing test? A fraud. Listing the 50 TV shows is how you measure REAL intelligence, take notes libtards. 😎
@amarosnithe The simple fact is there isn't a drop of added intelligence as they improved domains like coding, math and reason. or when adding "thinking" tokens. It's still 100% knowledge (pattern matching).
An intelligent human can be given the rules (e.g. coding or math) and only a handful of examples and figure out how to solve novel problems, write novel poems, and so on. That's intelligence (understanding and extrapolation).
In contrast, AI models have absolutely no intelligence and can't solve any math, coding, logic... problem type, no matter how simple, that aren't in the training data. This is why they're trained on trillions of coding, math, and reason tokens. They need to recall the solutions to every supported coding, math, logic... problem and make superficial modifications (e.g. change the variables) to match the user's prompt.
So until a breakthrough is made LLMs remain 100% about pattern matching (knowledge) using statistical next token prediction.
Consequently, the decision to trade broad abilities and more than an order of magnitude of broad knowledge for incremental gains in a handful of domains isn't, in my opinion, remotely reasonable. Regardless, the resultant AI model is nothing but a tool. Gold medal human mathematicians don't struggle to hold basic functional conversations with other humans, write stories, end sentences with a given word, and so on. Humans are broadly capable and adaptable. Your criticisms and examples (e.g. calling someone who bakes cookies dumb) shows that you don't grasp this distinction. An AI model NEEDS to have some semblance of balance. If it can only do a handful of things, but is otherwise an incoherent hallucination generator, then it's just a tool.
@amarosnithe The simple fact is there isn't a drop of added intelligence as they improved domains like coding, math and reason. or when adding "thinking" tokens. It's still 100% knowledge (pattern matching).
They literally didn't need to "improve" upon anything bro, did you even read the paper or see an overview of how they trained this thing? Here, let me quote some interesting snippets for you:
https://arxiv.org/abs/2605.05365
"With under 1B active parameters, ZAYA1-8B matches or exceeds DeepSeek-R1-0528 on several challenging mathematics and coding benchmarks, and remains competitive with substantially larger open-weight reasoning models"
^Another example showing that it was explicitly made with maths and coding in mind, and it surpasses autistically larger models too. This is good, this is fucking excellent, I praised them for it, and here you are bitching and moaning about it not knowing Big Bang Theory. Jesus fucking christ man.
"ZAYA1-8B was trained from scratch for reasoning, with reasoning data included from pretraining onward using an answer-preserving trimming scheme"
^Self-explanatory, they used a cutting-edge data trimming system that preserves answers in reasoning data and doesn't get convoluted like a flood of random data. (It was made for reasoning, it reasons, it thinks, the Post-Training RL Pipeline also reflects this).
"Post-training uses a four-stage RL cascade: reasoning warmup on math and puzzles; a 400-task RLVE-Gym curriculum; math and code RL with test-time compute traces and synthetic code environments built from competitive-programming references; and behavioral RL for chat and instruction following."
Would you look at that, more math, reasoning and code (and even puzzles!). As you can clearly see, the model was built from the start to be a math & code-specialist, and only at the final step did they use chat and instruction following for it to learn how to speak and behave. They didn't improve any of its core domains (and didn't need to) because THE MODEL IS NATURALLY BUILT FOR THAT. That's his purpose, his raison d'être. You simply don't know what you're talking about (hence why you're pulling out of your ass that models needs to be generalists to be considered AIs, an AI is a cluster of neurons that does stuff as I explained, you can disagree with the entire field of Machine Learning if you want but that's just coping, you won't convince anyone else with this bedtime story).
An intelligent human can be given the rules (e.g. coding or math) and only a handful of examples and figure out how to solve novel problems, write novel poems, and so on. That's intelligence (understanding and extrapolation).
In contrast, AI models have absolutely no intelligence and can't solve any math, coding, logic... problem type, no matter how simple, that aren't in the training data. This is why they're trained on trillions of coding, math, and reason tokens. They need to recall the solutions to every supported coding, math, logic... problem and make superficial modifications (e.g. change the variables) to match the user's prompt.
So until a breakthrough is made LLMs remain 100% about pattern matching (knowledge) using statistical next token prediction.
Wrong again. I can cite many examples of different AI MODELS and architectures that proved their models clearly reason and aren't just regurgitating blind patterns. A key example is how HRM (Hierarchical Reasoning Model) exhibited the impressive feat of a microscopic model (27M) surpassing the mainstream ones (again, in the hundreds of billions-to-trillions of parameters) in many tasks and benchmarks like puzzles, sudoku and math, despite the fact that it was trained using only 1000 examples. I'll ask you this, how was this model much better at doing puzzles and math than much bigger models? Did it require an even higher amount of data/knowledge to surpass them? The answer is obviously no because again, it did this with only a thousand examples. When this model released there was a whole controversy that happened because it was proven to have some data contamination going on, but even after the devs fixed it, it still maintained an extremely high accuracy that's completely abnormal for a model of that size. TRM (Tiny Recursive Model) expanded upon this, Samsung acknowledged the paper and made it even better, you don't need to feed it every single multiplication in history for it to learn their results, yes, math is harder for AI compared to other topics, but it can do math quite well by itself (and with limited examples).
Besides, now you're just arguing over semantics. "Le humans can actually think better than current AIs, therefore AI not intelligent!!" (and like, when the fuck did we pivot this conversation to humans? Wasn't this argument about this model in particular and how "AI dumb unless it kno Bazinga man 🤤" or some shit? lol)
Consequently, the decision to trade broad abilities and more than an order of magnitude of broad knowledge for incremental gains in a handful of domains isn't, in my opinion, remotely reasonable.
Speaking out of your ass again because they didn't trade shit, they never cared about broad knowledge and the model never had broad abilities to begin with (this isn't 2022, you don't need to train a model of trillions of tokens anymore about all the topics in existence just for it to be good at generating math). It was built for math and code, and its good at math and code. It fullfilled its purpose quite successfully (so much so that the "incremental gains" consist of it rivaling models in the trillions of parameters despite being absolutely nothing size-wise, good job in undermining their achievements).
Regardless, the resultant AI model is nothing but a tool. Gold medal human mathematicians don't struggle to hold basic functional conversations with other humans, write stories, end sentences with a given word, and so on. Humans are broadly capable and adaptable. Your criticisms and examples (e.g. calling someone who bakes cookies dumb) shows that you don't grasp this distinction. An AI model NEEDS to have some semblance of balance. If it can only do a handful of things, but is otherwise an incoherent hallucination generator, then it's just a tool.
Cool story bro, and all of this because this model can't tell you who licked Sheldon's big black smelly bootycheeks in that one episode. You're clearly the incarnation of Yann LeCun and are the world's authority in what constitutes an "AI" or not because "uhh humans something something and uhh this model not Bazinga o algo".
@amarosnithe If you had valid points you wouldn't keep using absurd and vulgar red herring arguments involving things like Sheldon's smelly bootycheecks. I only provided examples to make a point, and they asked for the main casts of the most viewed TV shows in human history, and not what happened in a particular episode, which is an entirely different matter.
And the real irony here is that I'm advocating for broad abilities and balance, while you're advocating for targeted gains in specific domains, yet you keep accusing me of wanting AI for a specific thing. I genuinely want the AI to remain at least as good at coding, math, STEW knowledge... than story writing, pop culture knowledge...
Anyways, this conversation is going nowhere, but I would still like to focus on one key point. Namely, looping. No human would get stuck in an infinite loop, yet these so called thinking/COT models frequently do. If they had even an inkling of generalized reasoning or awareness this wouldn't happen. It happens because current LLMs are 100% about moment to moment pattern matching (knowledge). So when strong contradicting patterns are matched, or a pattern is too strong (obsessively trained), looping will occur.
You can belittle other domains all you want, but the truth is coding and math were chosen for overfitting because they're the lowest hanging fruit and provide the most reward (money, prestige...) when overfit. For example, it's harder to train an LLM to write an eloquent on topic poem that adheres to rhyme and meter than code blocks and math solutions.
@amarosnithe You titled this thread "What on earth did you guys do", phil111 told you exactly what they did.
It's amusing to see you progressively get more offended as you fail to understand how models are trained, and how easy it is to make them excel at the tasks they've been trained to excel at in benchmarks, even when they actually fall flat on real life tasks. How many times are you going to fall for the old "it's smarter than the latest GPT/Claude!" claim?
@amarosnithe You titled this thread "What on earth did you guys do", phil111 told you exactly what they did.
It's amusing to see you progressively get more offended as you fail to understand how models are trained, and how easy it is to make them excel at the tasks they've been trained to excel at in benchmarks, even when they actually fall flat on real life tasks. How many times are you going to fall for the old "it's smarter than the latest GPT/Claude!" claim?
Did he? Because all I got was "Humans think better so this is dumb, and this is dumb because it can't list the Big Bang Theory cast, and this is dumb because AI is never intelligent unless I say so". Like ok bro, whatever you say lmfao (and he has the audacity to claim I'm the one putting some "red herring" in the discussion, when he talked about several different things at once).
I actually read the papers (he clearly didn't, like what was bro yapping about when he said they "traded off broader knowledge for this"), and was genuinely impressed by the work that was put into this (a model that does what its supposed to do). I made this post to praise the developers, not to have some edgy losers trying to tell me the reasons why this model akshually sucks and why I'm wrong. The model was made for math, coding, science and reasoning, its good at math, coding, science and reasoning (and despite its size it does it better than mainstream models), so I'm going to praise the development team for such a feat. Is it that hard to understand this? Are you guys 15th century bots or some shit? Like damn.
Also, quite mighty generous of you to equate "Talking about Big Bang Theory" as a "real life task". You go girl 👏
@amarosnithe You titled this thread "What on earth did you guys do", phil111 told you exactly what they did.
It's amusing to see you progressively get more offended as you fail to understand how models are trained, and how easy it is to make them excel at the tasks they've been trained to excel at in benchmarks, even when they actually fall flat on real life tasks. How many times are you going to fall for the old "it's smarter than the latest GPT/Claude!" claim?
Did he? Because all I got was "Humans think better so this is dumb, and this is dumb because it can't list the Big Bang Theory cast, and this is dumb because AI is never intelligent unless I say so". Like ok bro, whatever you say lmfao (and he has the audacity to claim I'm the one putting some "red herring" in the discussion, when he talked about several different things at once).
Your question was "What on earth did you guys do", this is the very first paragraph of his very first response:
A little reality check. This model is grossly overfit to a handful of domains. For example, despite having 8b parameters this model has notably less broad knowledge than 1b models like Llama 1b. Not only that, its instruction following is also worse than 1b models, such as not listing the actor and character names as requested below.
There's your answer: overfit. You are so offended at the suggestion of the model not being able to recall the Friends' cast that you are failing to understand the point he is making of the model being overfit. You getting so emotional about it was a very sweet treat I wasn't expecting to read though, thanks for the laugh!
Your question was "What on earth did you guys do", this is the very first paragraph of his very first response:
Did bro really take "What on earth did you guys do" this seriously? It's a RHETORICAL QUESTION LMFAO, ITS FOR DRAMATIC EFFECT (I'M NOT LITERALLY ASKING THESE PEOPLE WHAT THIS IS), THERE'S NO WAY I'M DEALING WITH GROWN ASS MEN WHO DON'T UNDERSTAND SECOND-DEGREE HAHAHAHAHA.
just reading the beginning of this post makes it obvious btw, I wrote "Bro.. what's this?" (is that how we begin normal questions nowadays😭), before writing a paragraph detailing how I reacted to this model existence, and another paragraph praising the development team, deadass, what ARE WE DOING here 😭. Its like when MiniCPM released and I made a post praising the development team due to how monstrous that model is (its the whole point, I like praising people, and you got butthurt because I genuinely don't give a fuck about what you 2 are yapping about 😭).
There's your answer: overfit. You are so offended at the suggestion of the model not being able to recall the Friends' cast that you are failing to understand the point he is making of the model being overfit.
Incorrect. If you read the model's paper you'll find that overfitting has nothing to do with it simply because the model's architecture is unlike a normal Transformers. These guys used a very specific and innovative training regimen strictly focused on the domains the model was trained for: maths, coding, science and reasoning. This is the reason why DarkSydePhil over there is genuinely speaking out of his ass, he didn't read the paper (and didn't even read the model's card lmfao) that clearly emphasizes and outlines the entire training process that this model has been through, these developers used a completely different method of training (especially Post-Training RL) that completely emphasized on specialization (again, general knowledge was never the point, so expecting it to talk about Big Bang Theory is idiotic since it simply wasn't trained on that). Hence everything he said is completely irrelevant. A "mentally deficient" amoeba can grasp this.
You getting so emotional about it was a very sweet treat I wasn't expecting to read though, thanks for the laugh!
Whatever you say "Wolf Lover". Just make sure you learn what a "rhetorical question" is before you go. This is a concept that little kids in middle school already know about!
@amarosnithe If you had valid points you wouldn't keep using absurd and vulgar red herring arguments involving things like Sheldon's smelly bootycheecks. I only provided examples to make a point, and they asked for the main casts of the most viewed TV shows in human history, and not what happened in a particular episode, which is an entirely different matter.
And the real irony here is that I'm advocating for broad abilities and balance, while you're advocating for targeted gains in specific domains, yet you keep accusing me of wanting AI for a specific thing. I genuinely want the AI to remain at least as good at coding, math, STEW knowledge... than story writing, pop culture knowledge...
Anyways, this conversation is going nowhere, but I would still like to focus on one key point. Namely, looping. No human would get stuck in an infinite loop, yet these so called thinking/COT models frequently do. If they had even an inkling of generalized reasoning or awareness this wouldn't happen. It happens because current LLMs are 100% about moment to moment pattern matching (knowledge). So when strong contradicting patterns are matched, or a pattern is too strong (obsessively trained), looping will occur.
You can belittle other domains all you want, but the truth is coding and math were chosen for overfitting because they're the lowest hanging fruit and provide the most reward (money, prestige...) when overfit. For example, it's harder to train an LLM to write an eloquent on topic poem that adheres to rhyme and meter than code blocks and math solutions.
Dont know how anyone could take this guy seriously at this point but one thing is quite sure: If he commented on the model, its probably quite capable haha
All of what he wants could be achieved with a proper RAG system, baking knowledge like that into the model is just a complete waste of model space
What if Charlie Sheen ever changes his legal name?
The model would hallucinate the wrong old name for eternity and you would call that "intelligent" ?
No thanks, I'd rather have models capable of retrieving such information reliably (actual current information) rather than regurgitating static knowledge.
And for the poems part, well rather leave that to the humans, who shall read tons of AI generated poetry? Who would benefit from that? AI could potentially generate more poetry than any human could ever read, effectively turning it into "slop", just like anyone can create "slop" software now. Do you really want that? Do you even enjoy poetry? Or do you just want to watch the world burn?
Also on an additional note:
Ask the model for any specific implementation details on any code library or to truthfully replicate it.
It will get the general gist and then fail on the details, which is wait for it
Kind of the same as the model being able to recall what a show/series is about but will hang on the details.
So to say its grossly overfit is just plain wrong, you are just experiencing what "coders" also experience in a different domain
Which is why once again...
It is important to not train the models on too much static knowledge and rather train it to retrieve this information dynamically :)
@ztsvvstz I certainly see the benefit of RAG in some circumstances, but you're failing to comprehend the major disadvantages of RAG across most tasks other than simple Q&A and a couple others.
Take story writing for example. You can't write a meaningful story about a topic you know little to nothing about, and using RAG to attain the knowledge fails for several reason, including breaking the organic flow of the story.
For example, (1) there's a huge latency as the relevant information is gathered, (2) there's reliance on network availability and server uptime to retrieve the information, which often isn't available while traveling etc., (3) there's filling up the context with said information which consume a ton of RAM and slows things down considerably (4) there's a huge break in the organic flow of the story as the random and unstructured info thrown into the context is forced into the story, and so on.
And these same problems (latency, slowdown, spike in resources, break of organic flow...) persist across all complex tasks.
Plus I should re-iterate that at this point LLMs are 100% about knowledge (pattern matching), which includes coding and math, hence the countless trillions of coding and math examples (far more than any other domain) used to train the models, and failure to handle any coding or math problem type that isn't included in the training data. Also, the random STEM knowledge, including arbitrary names of people like little known scientists, have gone up per parameter count since Llama 3 / Gemma 2 days, so the knowledge is not only being retained, but increased, within select overfit domains. RAG would work just as well, if not better, for STEM knowledge (rarely worked into organic outputs like a story), but the models don't want a <50 MMLU score sans RAG so the model makers deliberately threw out much more popular knowledge (e.g. pop culture) to free up room for testmaxxing STEM knowledge.
It really appears that at the heart of the open source AI overfitting problem is the overabundance of autistic coding nerds in the early adoption community who just want even more autistic tools to automate their work flows, and see all other AI abilities as pointless, hence respond with condescending non-viable dismissals ('let them eat RAG'), red herrings, and so on. I really don't have an issue with autism, especially since a disproportionate number of scientific and technological discoveries were made by austists. But in this context the shared cognitive blind spots of autistic coders, coupled with their narrow minded obsessions, is pressuring AI companies to create STEM tools rather than general purpose and balanced AI models.
Another retard appears talking shit out of his ass (congratulations, you know what RAG is and complains about "slop" like every soyboy on the internet nowadays). Its genuinely baffling how you crayon sniffers cannot grasp the concept of a model SPECIALIZING itself on a topic, rather than Overfitting (reading the paper like I mentioned would clear up all misunderstandings, but since I'm dealing with grown ass men who like playing pretend doctors, I guess this will never happen, and I shouldn't take you seriously for it).
I also learned my lesson, HuggingFace is a cool website with a lot of cool people, but like any social media its going to have its own pseudo-intellectuals who genuinely cannot accept being wrong, or don't have the mental capacity of understanding what a "rhetorical question" is. Therefore, if I leave this topic open it will attract even greater and bigger retards that like asserting shit and claiming to know things they clearly don't (just like a moth to a flame), since clearly, I cannot praise a model anymore without having to hear some of the most ridiculous claims from the lowest denomination of mankind.
