2 big questions for AI progress in 2025-2026

Apr 23

On how good AI might—or might not—get at tasks beyond math & coding

15 Comments

a few observations on the space. IMO, the feedback stack is the profit stack. everyone's talking about bigger GPUs, but the scarcer asset is trusted reward data. OpenAI’s codex-style unit-test corpora, Anthropic’s constitutional critics, and DeepSeek’s cold-start chain-of-thought libraries now function like proprietary datasets. They compound: better models produce richer traces → richer traces feed better graders → graders mint harder tasks → moat widens. In some sense this is the meta learning for the models as models produce better hypothesis, the next generation of models get trained on these.

we are fixated on whether an answer is correct. For RL, it's how fine-grained can you can score partial progress.

- In chess and Go we reward every play via elo deltas.

- math proofs reward only at Q.E.D.

- Coding has a sweet mid spot —unit tests emit dense gradients (pass 7/10 tests).

- for domains with sparse signals, we will lag unless we invent dense synthetic metrics

value in the next cycle won’t be around shipping new LLMs; they’ll commoditize grading as a service. think “stripe for rewards”—plug-in APIs that mint dense, tamper-resistant reward signals for any digital domain. Whoever nails cross-domain, adversarial-resistant grading will dictate the pace of reasoning progress more than GPU fabs.

Expand full comment

Oliver Sourbut

Apr 23Edited

Nice post! A very relevant related concept is *'exploration/experimentation'*.

Learning systems gather new knowledge and insights from observations/data. Random or arbitrary data aren't especially helpful! It wants to be telling you something new that you didn't already know - so it pays to deliberately seek out/gather novelty and informative observations.

In contemporary frontier AI systems, it's been mostly humans responsible for gathering that 'high quality data', often in quite hands-off ways like scraping huge datasets from the internet, but latterly with more attention on procurement and curation of especially informative or exemplary data.

With RL, the data coming in starts to rely increasingly on the activity of the system itself - together with whatever grading mechanism is in place (which is what you foreground here). That's why lots of RL conversations of the past were so obsessed by exploration: taking judicious actions to get the most informative observations!

Still, in RL, the human engineers are able to curate training environments with high-signal automated feedback systems, as you've discussed here. On the other hand, once you're talking about activities like R&D of various kinds, the task of exploring *is inherently most of the task itself*, making within-context exploration essential! This makes 'learning to learn' or 'learning to explore/experiment' among the most useful ways to operationalise 'AGI', from my perspective. (Of course there are nevertheless also many transformative impacts that can come from AI merely with heaps of crystallised intelligence and less R&D ability.)

I'm pretty uncertain on how domain-generalisable this meta skill of 'being good at learning to explore' will turn out to be. I think the evidence from humans and orgs is that it's somewhat generalisable (e.g. IQ, g-factors, highly productive R&D orgs which span domains, the broader institution of 'science and technology' post those particular revolutions), but that domain-specific research taste, especially at frontiers, is also something that's necessarily acquired, or at least mastered, through domain-specific experience.

Expand full comment

Reply (1)

Oliver Sourbut

May 21

I expanded this thought into a full blogpost: https://www.oliversourbut.net/p/you-cant-skip-exploration

- Introducing exploration and experimentation

- Why does exploration matter?

- Research and taste

- From play to experimentation

- Exploration in AI, past and future

- Research by AI: AI with research taste?

- Opportunities

Expand full comment

David Conrad

May 7

Does this imply that we should expect less impressive results in areas with long horizons and unclear reward signals? Or is the hope that these kinds of problems are sufficiently decomposable that it simply crushing it at shorter length tasks will prove sufficient?

There are some set of problems that AI just won’t be able to speed up much, like understanding the effect of a change in primary school teaching methods on total career earnings.

Expand full comment

Reply (2)

Helen Toner

May 8

I don't know, I think that's exactly the question - will we see worse performance on those tasks (I'd expect yes), or will they turn out to be decomposable enough that the models can learn to do well (I'd guess broadly no, but perhaps yes in some areas). We'll see.

Expand full comment

normality

Aug 11Edited

I suspect that all that really matters now is how fast the AI can generate a path to a definite reward signal. If you can always get a definite win or lose in a nanosecond, it doesn't matter how many steps or decisions have to be made. The DNN will test that reward button billions of times a second, and use the results to hack its way to some horrifically complex, illegible solution that nonetheless does work. So now, we may see work shift towards getting more AI problem environments to react extremely quickly to probing by the agent.

Expand full comment

Sharmake Farah

May 2

There's an important sleeper paper, which if this scales to Deepseek V3 and GPT-4o, and keeps being more general, basically says that RL on verifiable rewards can't create new capabilities, and there's a hard limit to how much RL you can do before it saturates, and basically means RL on verifiable rewards is a sample efficiency trick, not a way to create new capabilities, and inference scaling would basically be dead:

https://arxiv.org/abs/2504.13837

Expand full comment

Reply (1)

Helen Toner

May 3

Big if true!

Would one way around this be if pretraining keeps yielding (at least some) perf improvements? e.g. obviously GPT-4.5 was disappointing at first blush, but perhaps it has higher underlying capacity to be RL-ified than earlier models, and perhaps that can continue..?

Expand full comment

Reply (1)

Sharmake Farah

May 3

I agree that pre-training scaling remains, and of course new algorithms/paradigms that fundamentally boost AI progress by way more could always happen, but it does mean all of the reports about RL on CoTs creating a new capabilities pathway is not true, and we'd have to realize that the faster progress from current RL will soon come to an end.

Future GPTs will be more capable, but RL on CoTs will not make them that much more capable beyond the base model, and is a limited sample efficiency trick, so there is no reason to scale inference compute very much, so we don't have 2 capabilities lines pushing upwards, we have 1 capability line with a constant factor boost from RL on CoTs.

Expand full comment

Connor Heaton

Apr 27

Great writeup! I'm particularly watching to see to what extent autograders will be able to identity (and thus enable) superhuman performance in unverifiable domains like business strategy.

By definition, there is no training data for superhuman performance in those domains, and humans ourselves might or might not be able to verify performance.

In that case, it would be something we'd have to test in the real world over time - does this strategy which may seem incomprehensible to humans (move 37) ultimately produce better results? Which would be an interesting trust fall to make, and if successful, open a scary era of human disempowerment, where AI isn't directly auditable by human minds.

We're already pretty deep into the who guards the guards problem with AI judges. I have to wonder which domains of superhuman performance would most contribute to disempowerment.

Expand full comment

Difa gell Mann

Apr 23

Very helpful explanation!

Expand full comment

JamesLeng

Apr 30

Chemistry seems like an odd example to pick for "hard to verify." Given a huge compute budget, specific proposed reactions can be checked by simulating quantum underpinnings of molecular behavior in as much detail as necessary. Same for astrophysics, or the weather: approximations get invented and refined all the time, tradeoffs evaluated between speed and accuracy.

Expand full comment

Reply (1)

Helen Toner

Apr 30

As I understand it, trying to verify chemical synthesis pathways in simulation could be possible in principle, but is totally computationally intractable with anything like current systems. Chemists do use simulations, but they have to verify them experimentally before they believe them.

IIUC this is an area where quantum computing could make a big difference, so perhaps we'll have auto-verification in chemistry at some future point, but I don't think it's any time soon.

Expand full comment

Reply (1)

JamesLeng

Apr 30

Fair enough!

Then again, if someone does start training an AI specifically on a dataset of previously simulated chemical synthesis pathways, with detailed feedback from cases where results didn't match the experimental outcome, maybe it'll find a simpler, better simulation method which identifies and resolves flaws in our core understanding of the fundamental theories involved. That seems like it could get *very* interesting.

Expand full comment

Roderick Read

Apr 26

An area to expect progress on auto graded results is mechatronic design.

Modelling the capabilities of structure and actuator combinations to perform in simulation can be designed and done now by AI

Expand full comment

Rising Tide

2 big questions for AI progress in 2025-2026