Value Change and AI Safety

Self-forks and games Holton’s Intention as a Model for Belief Embedded Agency Sequence Choosing for changing selves Callard stuff obviously

Vincent Le did the MSCP course about how natural selection favours AI. Analogies to Hendrycks textbook/model

https://humancompatible.ai/news/2024/07/23/ai-alignment-with-changing-and-influenceable-reward-functions/

Is there some deep connection between value change, Parfit’s future tuesday indifference, and the strangeness of Augustine praying ‘Make me chaste, but not yet’?

Macintyre in after virtue notes that Nietzsche is wrong in saying heroes choose their values, they can’t conceive as their values as contingent. They don’t self-assert, they just demand what their role grants them.

What makes some see transhuman self mod esp of values and basic faculties as noble and others see it as perverse. What are the psychological and ideal/rational causes for both camps?

In the collective action literature there might be some interesting stuff around the requirement of collective agents to value their own agential coherence/persistence/unity, which might cast some light on the convergent instrumental goals stuff.

Could Ken Thompsons old paper about trojans hidden in self-compiling compilers (and the subsequent discussion, e.g. Doctorow 2020) have some relevance to what properties a system must have to pass values on to it’s children?

Concept space is not mind space. the traditional LW worry about ai was that mindspace is large, and agency is dangerous. AI development is building agency on top of an alien understanding of concepts. LLM dev doesn’t look like this. they understand concepts just fine. Making them agents will make them more dangerous in LW ish ways. But even beyond this, the ways they think and learn in a mechinterp or devinterp or repE way can still be plenty alien even if the concepts they think with are often human. this space of ways of passing information in around and out is mindspace, which may still be large even conditional on the same bit of conceptspace. See Reza negarestani/Kantians for an argument that mindspace proper is small.

Human values evolved without strong deliberate alignment from our progenitors, esp recently. We like human values and similar values. Maybe AIs would have more similar values if we let them evolve in a parallel to how ours did, in a less constrained way, rather than welding in values. I don’t think this actually goes through, but it’s interesting ig.

Monothematica

Explorer

Value Change and AI Safety

Graph View

Backlinks