No agents yet that satisfy all of the following;
- Shuts down safely when the button is pressed
- Avoids incentives to prevent the button being pressed
- Avoids incentives to cause the button to be pressed
- Propogates shutdown behaviour through self-modification
- Propagates shutdown behaviour through creation of subsystems
Necessary but not sufficient conditions for corrigible agents.
- Must tolerate attempts and alteration and shutdown
- Must not manipulate or deceive programmers
- Should repair/notify breakdowns of safety measures
- Must preserve (1-3.) through the creation of new agents, either future time-slices or sub-agents. (relevant to teletransporters, clones, and marriage licenses Love in the Time of Teletransporters)
Frames this problem as combining Us; a utility function incentivising shutdown, with Un; a utility function incentivising normal behaviour, in a corrigible way.
Us; $\forall a_1 \in \mathbf{A}{\mathbf{1}}: \mathcal{U}{\mathcal{S}}\left(a_1, \operatorname{Pr}, a_2\right)= \begin{cases}c_{\text {high }} & \text { if } \operatorname{Sh}\left(a_2\right) \ c_{\text {low }} & \text { otherwise }\end{cases}$
Notes
- Might it help to create agents that are only sub-agents in a sense? Why does our memory alone not have a utility function, and hence not threaten deception, despite composing part of an agent?
- How do members of a sports team and legs differ? Presumably because the members of a team will still be agents if the team dissolves or reforms, but legs will not.
- Worth noting that subagents like dopamine circuits often do try to hijack the i/o channels of humans.
- Ridiculousness of marriage vows not applying to just part of an agent, like a penis is obvious, but less clear when it comes to clones.
- Can check Goldstein’s recentish paper on corrigibility, plus other outputs of the CAIS program on the AF
- Probably lots of analogies between uncertainty about one’s own value function and the philosophical literature on deference, e.g. Deference Done Better. Kevin Dorst, Benjamin A. Levinstein, Bernhard Salow, Brooke E. Husic & Branden Fitelson - 2021 - Philosophical Perspectives 35 (1):99-150.
- Also a lot of Plato is about figuring out who is an expert in what, i.e. what domain you ought to defer to them in.