“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck Shlegeris

Dec 4, 2025
36 mins

Episode Description

Subtitle: The basic arguments about AI motivations in one causal graph.

Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.

Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.

All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.

This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected.

In this post I’ll spell out what this more general principle means and why it's helpful. Specifically:

  • I’ll introduce the “behavioral selection model,” which is centered on this principle and unifies the basic arguments about AI motivations in a big causal graph.

  • I’ll discuss the basic implications for AI motivations.

  • [...]

---

Outline:

(02:19) How does the behavioral selection model predict AI behavior?

(05:25) The causal graph

(09:26) Three categories of maximally fit motivations (under this causal model)

(09:46) 1. Fitness-seekers, including reward-seekers

(11:48) 2. Schemers

(14:08) 3. Optimal kludges of motivations

(17:36) If the reward signal is flawed, the motivations the developer intended are not maximally fit

(19:56) The (implicit) prior over cognitive patterns

(24:13) Corrections to the basic model

(24:28) Developer iteration

(27:06) Imperfect situational awareness and planning from the AI

(28:46) Conclusion

(31:33) Appendix: Important extensions

(31:39) Process-based supervision

(33:09) White-box selection of cognitive patterns

(34:40) Cultural selection of memes

The original text contained 21 footnotes which were omitted from this narration.

---

First published:
December 4th, 2025

Source:
https://blog.redwoodresearch.org/p/the-behavioral-selection-model-for

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Illustrative depiction of different cognitive patterns encoded in the weights changing in influence via selection.The causes and consequences of being selected (“I have influence through deployment”), from the perspective of a cognitive pattern during one episode of training __T3A_FOOTNOTE_REMOVED__(e.g., for any cognitive pattern, getting higher reward on this training episode causes it to have higher influence through deployment). This graph helps us explain why schemers, fitness-seekers (including reward-seekers), and certain kludges of motivations are all behaviorally fit.Flowchart showing influence through deployment leading to long-term power and paperclips.We can see various arguments about priors in this causal graph: There is only one instruction-follower; Meanwhile, there are many schemers, so their cumulative prior might be large; But further causally upstream motivations might have an advantage in terms of reasoning speed and reliability; However, if you go too far upstream then a vast mess of drives are required to get maximally selected, which might be penalized for poor simplicity.Developers noticing hacking can prevent a cognitive pattern from having influence in deployment.Flowcharts comparing outcome-based supervision versus process-based supervision pathways to power.Diagram comparing behavioral selection and white-box selection processes for AI cognitive patterns.Two flowcharts comparing selection of cognitive patterns versus cultural selection of memes.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

See all episodes