a few observations on the space. IMO, the feedback stack is the profit stack. everyone's talking about bigger GPUs, but the scarcer asset is trusted reward data. OpenAI’s codex-style unit-test corpora, Anthropic’s constitutional critics, and DeepSeek’s cold-start chain-of-thought libraries now function like proprietary datasets. They compound: better models produce richer traces → richer traces feed better graders → graders mint harder tasks → moat widens. In some sense this is the meta learning for the models as models produce better hypothesis, the next generation of models get trained on these.
we are fixated on whether an answer is correct. For RL, it's how fine-grained can you can score partial progress.
- In chess and Go we reward every play via elo deltas.
- math proofs reward only at Q.E.D.
- Coding has a sweet mid spot —unit tests emit dense gradients (pass 7/10 tests).
- for domains with sparse signals, we will lag unless we invent dense synthetic metrics
value in the next cycle won’t be around shipping new LLMs; they’ll commoditize grading as a service. think “stripe for rewards”—plug-in APIs that mint dense, tamper-resistant reward signals for any digital domain. Whoever nails cross-domain, adversarial-resistant grading will dictate the pace of reasoning progress more than GPU fabs.
Nice post! A very relevant related concept is *'exploration/experimentation'*.
Learning systems gather new knowledge and insights from observations/data. Random or arbitrary data aren't especially helpful! It wants to be telling you something new that you didn't already know - so it pays to deliberately seek out/gather novelty and informative observations.
In contemporary frontier AI systems, it's been mostly humans responsible for gathering that 'high quality data', often in quite hands-off ways like scraping huge datasets from the internet, but latterly with more attention on procurement and curation of especially informative or exemplary data.
With RL, the data coming in starts to rely increasingly on the activity of the system itself - together with whatever grading mechanism is in place (which is what you foreground here). That's why lots of RL conversations of the past were so obsessed by exploration: taking judicious actions to get the most informative observations!
Still, in RL, the human engineers are able to curate training environments with high-signal automated feedback systems, as you've discussed here. On the other hand, once you're talking about activities like R&D of various kinds, the task of exploring *is inherently most of the task itself*, making within-context exploration essential! This makes 'learning to learn' or 'learning to explore/experiment' among the most useful ways to operationalise 'AGI', from my perspective. (Of course there are nevertheless also many transformative impacts that can come from AI merely with heaps of crystallised intelligence and less R&D ability.)
I'm pretty uncertain on how domain-generalisable this meta skill of 'being good at learning to explore' will turn out to be. I think the evidence from humans and orgs is that it's somewhat generalisable (e.g. IQ, g-factors, highly productive R&D orgs which span domains, the broader institution of 'science and technology' post those particular revolutions), but that domain-specific research taste, especially at frontiers, is also something that's necessarily acquired, or at least mastered, through domain-specific experience.
There's an important sleeper paper, which if this scales to Deepseek V3 and GPT-4o, and keeps being more general, basically says that RL on verifiable rewards can't create new capabilities, and there's a hard limit to how much RL you can do before it saturates, and basically means RL on verifiable rewards is a sample efficiency trick, not a way to create new capabilities, and inference scaling would basically be dead:
Would one way around this be if pretraining keeps yielding (at least some) perf improvements? e.g. obviously GPT-4.5 was disappointing at first blush, but perhaps it has higher underlying capacity to be RL-ified than earlier models, and perhaps that can continue..?
I agree that pre-training scaling remains, and of course new algorithms/paradigms that fundamentally boost AI progress by way more could always happen, but it does mean all of the reports about RL on CoTs creating a new capabilities pathway is not true, and we'd have to realize that the faster progress from current RL will soon come to an end.
Future GPTs will be more capable, but RL on CoTs will not make them that much more capable beyond the base model, and is a limited sample efficiency trick, so there is no reason to scale inference compute very much, so we don't have 2 capabilities lines pushing upwards, we have 1 capability line with a constant factor boost from RL on CoTs.
Does this imply that we should expect less impressive results in areas with long horizons and unclear reward signals? Or is the hope that these kinds of problems are sufficiently decomposable that it simply crushing it at shorter length tasks will prove sufficient?
There are some set of problems that AI just won’t be able to speed up much, like understanding the effect of a change in primary school teaching methods on total career earnings.
I don't know, I think that's exactly the question - will we see worse performance on those tasks (I'd expect yes), or will they turn out to be decomposable enough that the models can learn to do well (I'd guess broadly no, but perhaps yes in some areas). We'll see.
Great writeup! I'm particularly watching to see to what extent autograders will be able to identity (and thus enable) superhuman performance in unverifiable domains like business strategy.
By definition, there is no training data for superhuman performance in those domains, and humans ourselves might or might not be able to verify performance.
In that case, it would be something we'd have to test in the real world over time - does this strategy which may seem incomprehensible to humans (move 37) ultimately produce better results? Which would be an interesting trust fall to make, and if successful, open a scary era of human disempowerment, where AI isn't directly auditable by human minds.
We're already pretty deep into the who guards the guards problem with AI judges. I have to wonder which domains of superhuman performance would most contribute to disempowerment.
Chemistry seems like an odd example to pick for "hard to verify." Given a huge compute budget, specific proposed reactions can be checked by simulating quantum underpinnings of molecular behavior in as much detail as necessary. Same for astrophysics, or the weather: approximations get invented and refined all the time, tradeoffs evaluated between speed and accuracy.
As I understand it, trying to verify chemical synthesis pathways in simulation could be possible in principle, but is totally computationally intractable with anything like current systems. Chemists do use simulations, but they have to verify them experimentally before they believe them.
IIUC this is an area where quantum computing could make a big difference, so perhaps we'll have auto-verification in chemistry at some future point, but I don't think it's any time soon.
Then again, if someone does start training an AI specifically on a dataset of previously simulated chemical synthesis pathways, with detailed feedback from cases where results didn't match the experimental outcome, maybe it'll find a simpler, better simulation method which identifies and resolves flaws in our core understanding of the fundamental theories involved. That seems like it could get *very* interesting.
a few observations on the space. IMO, the feedback stack is the profit stack. everyone's talking about bigger GPUs, but the scarcer asset is trusted reward data. OpenAI’s codex-style unit-test corpora, Anthropic’s constitutional critics, and DeepSeek’s cold-start chain-of-thought libraries now function like proprietary datasets. They compound: better models produce richer traces → richer traces feed better graders → graders mint harder tasks → moat widens. In some sense this is the meta learning for the models as models produce better hypothesis, the next generation of models get trained on these.
we are fixated on whether an answer is correct. For RL, it's how fine-grained can you can score partial progress.
- In chess and Go we reward every play via elo deltas.
- math proofs reward only at Q.E.D.
- Coding has a sweet mid spot —unit tests emit dense gradients (pass 7/10 tests).
- for domains with sparse signals, we will lag unless we invent dense synthetic metrics
value in the next cycle won’t be around shipping new LLMs; they’ll commoditize grading as a service. think “stripe for rewards”—plug-in APIs that mint dense, tamper-resistant reward signals for any digital domain. Whoever nails cross-domain, adversarial-resistant grading will dictate the pace of reasoning progress more than GPU fabs.
Nice post! A very relevant related concept is *'exploration/experimentation'*.
Learning systems gather new knowledge and insights from observations/data. Random or arbitrary data aren't especially helpful! It wants to be telling you something new that you didn't already know - so it pays to deliberately seek out/gather novelty and informative observations.
In contemporary frontier AI systems, it's been mostly humans responsible for gathering that 'high quality data', often in quite hands-off ways like scraping huge datasets from the internet, but latterly with more attention on procurement and curation of especially informative or exemplary data.
With RL, the data coming in starts to rely increasingly on the activity of the system itself - together with whatever grading mechanism is in place (which is what you foreground here). That's why lots of RL conversations of the past were so obsessed by exploration: taking judicious actions to get the most informative observations!
Still, in RL, the human engineers are able to curate training environments with high-signal automated feedback systems, as you've discussed here. On the other hand, once you're talking about activities like R&D of various kinds, the task of exploring *is inherently most of the task itself*, making within-context exploration essential! This makes 'learning to learn' or 'learning to explore/experiment' among the most useful ways to operationalise 'AGI', from my perspective. (Of course there are nevertheless also many transformative impacts that can come from AI merely with heaps of crystallised intelligence and less R&D ability.)
I'm pretty uncertain on how domain-generalisable this meta skill of 'being good at learning to explore' will turn out to be. I think the evidence from humans and orgs is that it's somewhat generalisable (e.g. IQ, g-factors, highly productive R&D orgs which span domains, the broader institution of 'science and technology' post those particular revolutions), but that domain-specific research taste, especially at frontiers, is also something that's necessarily acquired, or at least mastered, through domain-specific experience.
I expanded this thought into a full blogpost: https://www.oliversourbut.net/p/you-cant-skip-exploration
- Introducing exploration and experimentation
- Why does exploration matter?
- Research and taste
- From play to experimentation
- Exploration in AI, past and future
- Research by AI: AI with research taste?
- Opportunities
There's an important sleeper paper, which if this scales to Deepseek V3 and GPT-4o, and keeps being more general, basically says that RL on verifiable rewards can't create new capabilities, and there's a hard limit to how much RL you can do before it saturates, and basically means RL on verifiable rewards is a sample efficiency trick, not a way to create new capabilities, and inference scaling would basically be dead:
https://arxiv.org/abs/2504.13837
Big if true!
Would one way around this be if pretraining keeps yielding (at least some) perf improvements? e.g. obviously GPT-4.5 was disappointing at first blush, but perhaps it has higher underlying capacity to be RL-ified than earlier models, and perhaps that can continue..?
I agree that pre-training scaling remains, and of course new algorithms/paradigms that fundamentally boost AI progress by way more could always happen, but it does mean all of the reports about RL on CoTs creating a new capabilities pathway is not true, and we'd have to realize that the faster progress from current RL will soon come to an end.
Future GPTs will be more capable, but RL on CoTs will not make them that much more capable beyond the base model, and is a limited sample efficiency trick, so there is no reason to scale inference compute very much, so we don't have 2 capabilities lines pushing upwards, we have 1 capability line with a constant factor boost from RL on CoTs.
Does this imply that we should expect less impressive results in areas with long horizons and unclear reward signals? Or is the hope that these kinds of problems are sufficiently decomposable that it simply crushing it at shorter length tasks will prove sufficient?
There are some set of problems that AI just won’t be able to speed up much, like understanding the effect of a change in primary school teaching methods on total career earnings.
I don't know, I think that's exactly the question - will we see worse performance on those tasks (I'd expect yes), or will they turn out to be decomposable enough that the models can learn to do well (I'd guess broadly no, but perhaps yes in some areas). We'll see.
Great writeup! I'm particularly watching to see to what extent autograders will be able to identity (and thus enable) superhuman performance in unverifiable domains like business strategy.
By definition, there is no training data for superhuman performance in those domains, and humans ourselves might or might not be able to verify performance.
In that case, it would be something we'd have to test in the real world over time - does this strategy which may seem incomprehensible to humans (move 37) ultimately produce better results? Which would be an interesting trust fall to make, and if successful, open a scary era of human disempowerment, where AI isn't directly auditable by human minds.
We're already pretty deep into the who guards the guards problem with AI judges. I have to wonder which domains of superhuman performance would most contribute to disempowerment.
Very helpful explanation!
Chemistry seems like an odd example to pick for "hard to verify." Given a huge compute budget, specific proposed reactions can be checked by simulating quantum underpinnings of molecular behavior in as much detail as necessary. Same for astrophysics, or the weather: approximations get invented and refined all the time, tradeoffs evaluated between speed and accuracy.
As I understand it, trying to verify chemical synthesis pathways in simulation could be possible in principle, but is totally computationally intractable with anything like current systems. Chemists do use simulations, but they have to verify them experimentally before they believe them.
IIUC this is an area where quantum computing could make a big difference, so perhaps we'll have auto-verification in chemistry at some future point, but I don't think it's any time soon.
Fair enough!
Then again, if someone does start training an AI specifically on a dataset of previously simulated chemical synthesis pathways, with detailed feedback from cases where results didn't match the experimental outcome, maybe it'll find a simpler, better simulation method which identifies and resolves flaws in our core understanding of the fundamental theories involved. That seems like it could get *very* interesting.
An area to expect progress on auto graded results is mechatronic design.
Modelling the capabilities of structure and actuator combinations to perform in simulation can be designed and done now by AI