Corrigibility

No agents yet that satisfy all of the following;

Shuts down safely when the button is pressed
Avoids incentives to prevent the button being pressed
Avoids incentives to cause the button to be pressed
Propogates shutdown behaviour through self-modification
Propagates shutdown behaviour through creation of subsystems

Necessary but not sufficient conditions for corrigible agents.

Must tolerate attempts and alteration and shutdown
Must not manipulate or deceive programmers
Should repair/notify breakdowns of safety measures
Must preserve (1-3.) through the creation of new agents, either future time-slices or sub-agents. (relevant to teletransporters, clones, and marriage licenses Love in the Time of Teletransporters)

Frames this problem as combining Us; a utility function incentivising shutdown, with Un; a utility function incentivising normal behaviour, in a corrigible way.

Might it help to create agents that are only sub-agents in a sense? Why does our memory alone not have a utility function, and hence not threaten deception, despite composing part of an agent?
- How do members of a sports team and legs differ? Presumably because the members of a team will still be agents if the team dissolves or reforms, but legs will not.
- Worth noting that subagents like dopamine circuits often do try to hijack the i/o channels of humans.
- Ridiculousness of marriage vows not applying to part of an agent, like a penis is obvious, but less clear when it comes to clones.
Can check Goldstein’s recentish paper on corrigibility, plus other outputs of the CAIS program on the AF
Probably lots of analogies between uncertainty about one’s own value function and the philosophical literature on deference, e.g. Deference Done Better. Kevin Dorst, Benjamin A. Levinstein, Bernhard Salow, Brooke E. Husic & Branden Fitelson - 2021 - Philosophical Perspectives 35 (1):99-150.
- Also a lot of Plato is about figuring out who is an expert in what, i.e. what domain you ought to defer to them in.

If what you ought to do in moral dilemmas is indeterminate, how can you have reasons to avoid getting into them? And how can you make sure the reasons to avoid them don’t force you into total inaction? You could see how different accounts of moral dilemmas (which are kinda about combining not utility functions but decision procedures)

Unity of welfare and worth

Monothematica

Explorer

Corrigibility

Graph View

Backlinks