With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.
I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.
The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.
Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".
Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.
Wow, so not only are the findings from https://arxiv.org/abs/2506.21734 (posted on HN a while back) confirmed, they're generalizable? Intriguing. I wonder if this will pan out in practical use cases, it'd be transformative.
Also would possibly instantly void the value of trillions of pending AI datacenter capex, which would be funny. (Though possibly not for very long.)
This here looks like a stripped down version of HRM - possibly drawing on the ablation studies from this very analysis.
Worth noting that HRMs aren't generally applicable in the same way normal transformer LLMs are. Or, at least, no one has found a way to apply them to the typical generative AI tasks yet.
I'm still reading the paper, but I expect this version to be similar - it uses the same tasks as HRMs as examples. Possibly quite good at spatial reasoning tasks (ARC-AGI and ARC-AGI-2 are both spatial reasoning benchmarks), but it would have to be integrated into a larger more generally capable architecture to go past that.
That's a good read also shared by another poster above, thanks! If I'm reading this right, it contextualizes, but doesn't negate the findings from that paper.
I've got a major aesthetic problem with the fact LLMs require this much training data to get where they are, namely, "not there yet"; it's brute force by any other name, and just plain kind of vulgar. Although more importantly it won't scale much further. Novel architectures will have to feature in at some point, and I'll gladly take any positive result in that direction.
That analysis provided a very non-abrasive wording of their evaluation of HRM and its contributions. The comparison with a recursive / universal transformer on the same settings is telling.
"These results suggest that the performance on ARC-AGI is not an effect of the HRM architecture. While it does provide a small benefit, a replacement baseline transformer in the HRM training pipeline achieves comparable performance."
Also would possibly instantly void the value of trillions of pending AI datacenter capex
GPU compute is not just for text inferencing. The video generation demand is something I don’t think we’ll ever saturate for quite a while, even with breakthroughs.
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies.
This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal.
We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers.
With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
"With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters."
Well, that's pretty compelling when taken in isolation. I wonder what the catch is?
It won't be any good at factual questions, for a start; it will be reliant on an external memory. Everything would have to be reasoned from first principles, without knowledge.
My gut feeling is that this will limits its capability, because creativity and intelligence involve connecting disparate things, and to do that you need to know them first.
Though philosophers have tried, you can't unravel the mysteries of the universe through reasoning alone. You need observations, facts.
What I could see it good for is a dedicated reasoning module.
We'll need a memory system, an executive function/reasoning system as well as some sort of sense integration - auditory, visual, text in the case of LLMs, symbolic probably.
A good avenue of research would be to see if you could glue opencyc to this for external "knowledge".
If we could somehow weave in a reasoning tool directly into the inference process, without having to use the context for it, that’s be something. Perhaps compile to weights and pretend this part is pretrained…? No idea if it’s feasible, but it’d definitely be a breakthrough if AI had access to z3 in hidden layers.
" With only 7M parameters,
TRM obtains 45% test-accuracy on ARC-AGI-
1 and 8% on ARC-AGI-2, higher than most
LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5
Pro) with less than 0.01% of the parameters"
That is very impressive.
Side note:
Superficially reminds me of Hierarchical Temporal Memory from Jeff Hawkins "On Intelligence".
Although this doesn't have the sparsity aspect, its hierarchical and temporal aspects are related.
So what happens when we figure out how to 10x both scale and throughput on existing hardware by using it more efficiently? Will gigantic models still be useful?
Of course! We still have computers the size of mainframes that ran on vacuum tubes. They are just built with vastly more powerful hardware and are used for specialized tasks that supercomputing facilities care about.
But it has the potential to alter the economics of AI quite dramatically
I think everyone should read the post from ARC-AGI organisers about HRM carefully: https://arcprize.org/blog/hrm-analysis
With the same data augmentation / 'test time training' setting, the vanilla Transformers do pretty well, close to the "breakthrough" HRM reported. From a brief skim, this paper is using similar settings to compare itself on ARC-AGI.
I too, want to believe in smaller models with excellent reasoning performance. But first understand what ARC-AGI tests for, what the general setting is -- the one that commercial LLMs use to compare against each other -- and what the specialised setting HRM and this paper uses as evaluation.
The naming of that benchmark lends itself to hype, as we've seen in both HRM and this paper.
Not exactly "vanilla Transformer", but rather "a Transformer-like architecture with recurrence".
Which is still a fun idea to play around with - this approach clearly has its strengths. But it doesn't appear to be an actual "better Transformer". I don't think it deserves nearly as much hype as it gets.
Right. There should really be a vanilla Transformer baseline.
With recurrence: The idea has been around: https://arxiv.org/abs/1807.03819
There are reasons why it hasn't really been picked up at scale, and the method tends to do well on synthetic tasks.
Wow, so not only are the findings from https://arxiv.org/abs/2506.21734 (posted on HN a while back) confirmed, they're generalizable? Intriguing. I wonder if this will pan out in practical use cases, it'd be transformative.
Also would possibly instantly void the value of trillions of pending AI datacenter capex, which would be funny. (Though possibly not for very long.)
Any mention of "HRM" is incomplete without this analysis:
https://arcprize.org/blog/hrm-analysis
This here looks like a stripped down version of HRM - possibly drawing on the ablation studies from this very analysis.
Worth noting that HRMs aren't generally applicable in the same way normal transformer LLMs are. Or, at least, no one has found a way to apply them to the typical generative AI tasks yet.
I'm still reading the paper, but I expect this version to be similar - it uses the same tasks as HRMs as examples. Possibly quite good at spatial reasoning tasks (ARC-AGI and ARC-AGI-2 are both spatial reasoning benchmarks), but it would have to be integrated into a larger more generally capable architecture to go past that.
That's a good read also shared by another poster above, thanks! If I'm reading this right, it contextualizes, but doesn't negate the findings from that paper.
I've got a major aesthetic problem with the fact LLMs require this much training data to get where they are, namely, "not there yet"; it's brute force by any other name, and just plain kind of vulgar. Although more importantly it won't scale much further. Novel architectures will have to feature in at some point, and I'll gladly take any positive result in that direction.
That analysis provided a very non-abrasive wording of their evaluation of HRM and its contributions. The comparison with a recursive / universal transformer on the same settings is telling.
"These results suggest that the performance on ARC-AGI is not an effect of the HRM architecture. While it does provide a small benefit, a replacement baseline transformer in the HRM training pipeline achieves comparable performance."
Jevon’s paradox applies here IMHO. Cheaper AI/watt = more demand.
Also would possibly instantly void the value of trillions of pending AI datacenter capex
GPU compute is not just for text inferencing. The video generation demand is something I don’t think we’ll ever saturate for quite a while, even with breakthroughs.
It would be fitting if the AI bubble was popped by AI getting too good and too efficient
Abstract:
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies.
This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal.
We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers.
With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.
"With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters."
Well, that's pretty compelling when taken in isolation. I wonder what the catch is?
It won't be any good at factual questions, for a start; it will be reliant on an external memory. Everything would have to be reasoned from first principles, without knowledge.
My gut feeling is that this will limits its capability, because creativity and intelligence involve connecting disparate things, and to do that you need to know them first. Though philosophers have tried, you can't unravel the mysteries of the universe through reasoning alone. You need observations, facts.
What I could see it good for is a dedicated reasoning module.
That's been my expectation from the start.
We'll need a memory system, an executive function/reasoning system as well as some sort of sense integration - auditory, visual, text in the case of LLMs, symbolic probably.
A good avenue of research would be to see if you could glue opencyc to this for external "knowledge".
LLM's are fundamentally a dead end.
Github link: https://github.com/SamsungSAILMontreal/TinyRecursiveModels
Should it be a larger frontier model, with this as a tool call (tool call another llm) to verify the larger one?
Why not go nuts with it and put it in the speculative decoding algorithm.
If we could somehow weave in a reasoning tool directly into the inference process, without having to use the context for it, that’s be something. Perhaps compile to weights and pretend this part is pretrained…? No idea if it’s feasible, but it’d definitely be a breakthrough if AI had access to z3 in hidden layers.
" With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI- 1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters"
That is very impressive.
Side note: Superficially reminds me of Hierarchical Temporal Memory from Jeff Hawkins "On Intelligence". Although this doesn't have the sparsity aspect, its hierarchical and temporal aspects are related.
https://en.wikipedia.org/wiki/Hierarchical_temporal_memory https://www.numenta.com
I suspect the lack of sparsity is an Achilles' heel of the current LLM approach.
github https://github.com/SamsungSAILMontreal/TinyRecursiveModels
So what happens when we figure out how to 10x both scale and throughput on existing hardware by using it more efficiently? Will gigantic models still be useful?
Of course! We still have computers the size of mainframes that ran on vacuum tubes. They are just built with vastly more powerful hardware and are used for specialized tasks that supercomputing facilities care about.
But it has the potential to alter the economics of AI quite dramatically