For the past few years, the story of AI getting stronger has been straightforward: throw more training compute at it, feed it more data, make the model bigger. Think of it as reading more books and doing more practice problems—cramming capability into parameters during training. But recently, the winds may be shifting: a model doesn’t necessarily need to keep growing; as long as it “thinks a bit longer” when facing hard problems, results can improve significantly.

This has a more formal name in the industry: Test-time Scaling. The intuition behind it is simple: training is like studying, inference is like taking the exam. Previously we only competed on how hard we studied; now we’re competing on thinking 30 seconds longer during the test. The question becomes more practical: if we think longer every time, who pays for the cost? How can we afford it?

I recently came across two developments that explain this perfectly: one is the algorithmic approach of looped models (Looped Language Models / Ouro), and the other is the system and hardware approach with NVIDIA Rubin. In my view, they’re solving the same problem at different layers: the former teaches the brain how to think better, the latter turns “thinking better” into scalable, commercializable infrastructure.

1. Looped Models: Same Neurons, Stronger Synaptic Circuits

To simply understand what makes looped models unique, I prefer a brain analogy: looped models don’t add more neurons (parameter count stays the same), but they strengthen synaptic connections and circuit structures, making association and reasoning easier. This isn’t literary hand-waving—it corresponds almost one-to-one with a key finding in the paper: looping doesn’t significantly increase a model’s knowledge capacity, but significantly enhances its ability to use knowledge.

How does it work? LoopLM’s core design is simple: share parameters across a group of Transformer layers and repeatedly cycle through them in a single forward pass. Think of it as the same thinking module running multiple rounds, updating hidden states each iteration to progressively refine internal representations. This brings two direct benefits: first, it decouples computational depth from parameter scale; second, it doesn’t rely on generating longer output tokens for explicit thinking—instead, thinking happens in latent space (hidden states), avoiding the bloat of ever-growing context.

At the usage level, looped models typically come with an on-demand routing strategy: fewer loops for simple problems, more for hard ones. This isn’t a hand-crafted rule but something the model learns. The paper mentions a learned early-exit mechanism, with Q-exit as a deployment-time adjustable threshold, letting the same model make tradeoffs under different compute budgets without retraining. To me, this is where it most resembles human test-taking: not writing 20 lines of scratch work for every problem, but deliberating more on hard questions while breezing through easy ones.

Here’s a personal aside. I’m reminded of the anime “ID:INVADED,” where the protagonist enters a psychological device and becomes a memory-wiped, emotionless detective machine. It vividly demonstrates that in the same environment, someone with stronger reasoning circuits really can solve problems more easily.

2. Training-Time Looping + Inference-Time Looping: Both Are Essential

After reading about looped models, a natural question popped into my head: what if we just loop a regular Transformer a few times? After discussing with AI, the conclusion is: there might be some improvement, but it’s hard to match the effectiveness of a model trained with looping.

The reason is that for looped inference to work stably and effectively, certain foundations must be built in during training—like deeper loops should yield better predictions, but the model should also know when to stop. These capabilities are shaped by training objectives and processes. Looped models treat multi-round iteration as the norm during pretraining, often computing loss at different loop steps so the model learns that each additional round brings the answer distribution closer to correct. The paper states this property explicitly: the training objective needs to preserve “deeper-is-better,” letting next-token loss improve monotonically in expectation with loop count, turning the whole system into an anytime algorithm—you can start outputting from any intermediate step while later steps continue to verify or correct.

This is like cultivating problem-solving habits: regular models are like fast thinkers who glance and submit; loop-trained models are like those forced to think slowly during training, required to deliberate several rounds before stopping. Give them extra thinking budget at deployment, and they know how to use it wisely rather than wasting extra compute on meaningless repetition.

As an aside, the paper has an interesting observation: in experiments, after loop count exceeds training depth, some benchmarks may degrade, but safety alignment actually improves with more loops, even for extrapolated steps. This suggests at least one thing: when “thinking more rounds” becomes controllable, safety mechanisms might be redesigned alongside reasoning capabilities.

3. Rubin: Faster Reactions, Larger Working Memory

After discussing Ouro’s circuit upgrades, let’s return to reality: more thinking rounds means more compute, more intermediate states, more KV cache, more data movement. You can have the smartest thinking approach, but if every step is bottlenecked by memory and speed, you’ll still be dragged back into cost hell.

This is my strongest impression when looking at NVIDIA Rubin: the entire system is industrializing test-time scaling. Jensen Huang used “letting AI think a bit longer when encountering problems” as the setup, then described Rubin as a co-designed next-generation AI computing platform: Vera CPU, Rubin GPU, NVLink 6, ConnectX-9, BlueField-4, Spectrum-6 all working together, with the goal of driving down inference costs to make “thinking more” economically viable.

But there’s an easily misunderstood point here: many people’s first reaction is to compare Rubin’s context storage platform to the model gaining long-term memory. I think it’s more accurate to say: Rubin is rebuilding working memory, not stuffing knowledge into parameters as long-term memory, nor injecting external memory at user call time.

Stratifying memory makes this clearer: knowledge in model weights is long-term memory; KV cache and inference intermediate states are working memory; RAG/logs/vector stores are external memory. When large models slow down in multi-turn conversations, complex tool calls, and multi-agent collaboration, the root cause is often working memory placement and flow: either crammed into expensive, limited GPU memory, or fallen to slow storage killing throughput. What Rubin’s inference context memory storage platform aims to do is establish a third tier between GPU memory and traditional storage: larger capacity, faster speed, and shareable across nodes, so long context and multi-agent collaboration no longer slow to a crawl.

Imagine it like this: before, we had a detective solving cases without a workbench, having to replay footage and recopy notes every time; now we’ve given them a bigger desk, closer filing cabinets, faster conveyor belts, plus the ability to share clues with other detectives. Is the detective smarter? Not necessarily. But they can more stably and cheaply think more rounds without crashing from data shuffling.

4. Allocating Two Types of Budgets Determines Competitiveness in the Inference Era

When I look at looped models and Rubin together, a picture becomes clear: AI is moving from “bigger” to “better at allocating budgets.” And this budget splits into at least two types.

The first is compute budget. Looped models build “thinking more rounds” into the structure and provide deployment-time adjustable thresholds like Q-exit, letting the same model achieve different results under different budgets.

The second is memory budget. Looped inference naturally brings multiplied KV cache pressure—the paper directly acknowledges that the naive approach causes 4× memory overhead, then provides a very engineering-practical solution: KV cache reuse during decoding, using last-step or averaged sharing strategies to bring memory back to 1/4 with almost no performance drop. On the hardware side, Rubin goes further, making context placement and sharing a system-level third memory tier, so working memory for large-scale inference and multi-agent collaboration is no longer a bottleneck.

So if I had to summarize in one sentence: looped models are upgrading the brain’s thinking circuits, Rubin is upgrading the brain’s workbench and memory system. Together, they’re turning Test-time Scaling from concept into productivity.

5. Points Worth Continued Attention

  • Can looping (or more broadly, iterative inference) beat current training and invocation paradigms to become the better path? This depends on whether the data in these papers generalizes well enough.

  • The Memory track is not only crowded but has started stratifying: from bottom-layer hardware to top-layer user sovereignty, every layer deserves attention.

  • A thought experiment: could safety also become an allocatable budget? The observation that increased loop count may improve safety alignment is intriguing. If thinking longer not only makes you smarter but also more cautious, the relationship between safety and performance might be rewritten.