Ng and Russell note shaping rewards needing to be a conservative field, (e.g. GPE), in order to only ultimately reward world states, not actions, to avoid cycles of actions like riding in circles repeatedly.
the constraint is
F(s,a,s’) = Φ(s’) - Φ(s)
Where:
- F is your reward function
- Φ is a “potential function” that maps each state to a number
- The reward for any action depends only on the potential difference between destination and origin states
Do deontologists (people who give prescribe criteria of rightness/decision procedures that deal directly with agents policies/actions, not world states directly) need to ensure their (“shaping”) rewards be conservative fields in order not to count action loops as right/praiseworthy? I mean maybe, but the constraint above requires complete path independence (any two paths from a→b get the same reward), which is too restrictive for what deontologists want. I guess you could decompose the sequences of action into individual actions and reward each step, then sum to get the ‘praiseworthiness’ of the path?
- i think my point was just that caring about path dependence (i.e. not being pure consequentialists) means you cant use the above constraint, which means you need to figure out some alternative check or condition to ensure your system of rules/rewards doesn’t incentivise loops/less important rules/rewards incentivising policies that are suboptimal wrt lexically more important ones.
- imperfect alternatives would be stuff like novelty constraints, action frequency penalties,