The Alignment Problem and AI safety notes

Social choice/arrow impossibility and alignment: https://arxiv.org/abs/2310.16048

Ten AI safety projects I’d like people to work on https://thirdthing.ai/p/ten-ai-safety-projects-id-like-people

Models could generate two random keys at the start of a chat to tag their own vs the users posts, and have an instruction in the sys prompt or some other implementation that makes them flag or reject prompt injection attacks not wrapped in their own code. Same goes for another sys prompt key tag maybe.

Does CCS work better on middle layers for intuitively ‘more complex’ questions, suggesting the representations of beliefs are built up atomically layer by layer similarly to machine vision systems detection circuits? Idk apparently middle layers can outperform later layers when the prompts have misleading prefixes, which suggests its the later syntactical combinations that trip the models up when misleading prefixes are in play

X-risk concerns around nanotech and analogies to current AI risk

I think one could make significant progress on these questions just by talking to people who were engaged in (or at least aware of) debates around transformative nanotech in the 1980s or 1990s, including Eric Drexler, Christine Peterson, Eliezer Yudkowsky, and Robin Hanson. It would also be useful to read available histories of nanotechnology and to read essays, news coverage, popular fiction, and mailing list discussions from this period. (Ben Garfinkel - here)

Randall collins, can sociology create an artificial intelligence? cf hayek?

How much compute occurs via social cognition? Technically this is part of the environmental compute, but it may deserve special attention

We can think of pure pseudo agents like a bundle of fairly comprehensive if-then statements, these will reliably steer small, simple environments to certain states, but they won’t do crazy optimising stuff that Yudkowsky AGI would. But how do we know economic incentives will actually force systems to cross some threshold of agentiness where self-improvement, deception, misaligned OOD goals will actually result?

It looks like agents do generalise across full-information environments like Atari games, (are they all full info though???)

This is a reframing of the crux between Gwerns tool AI stuff and me.

And yet it understands - a lookup table of word- correlations would be too large to fit 8n the universe, yet Chat GPT fits on 800GB or so, so it must be compressing the data, i.e. abstracting, i.e. understanding concepts. Caveat of idk why regular compression algorithms arent an option for conpressing without understanding concepts, maybe they’re only so efficient and ChatGPT beats them out? https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment

Generally, how do semantics and compression relate?
See https://philpapers.org/rec/WILUAC-2
- Subquestions: how do information and understanding relate? Why do we care about understanding? How do understanding and semantics relate?
- Apparently understanding relates to explanation: Explanation, Ethics, and AI

And posts linked in comments have good stuff abt. Will AGI foom given difficulty of aligning successor states. Ties in with questions from continental AI course

Solutions to the Gavagai problem should also be solutions to the superposition/misgeneralising out of distribution problem in AI safety. For example, the natural abstraction hypothesis is a solution to both.

Cant competitions or interactions steer reality into narrow state spaces without being agents?

relevant to q of whether simulated maximisers have to escape and overtake their hosts or can you get systems that steer reality without being maximisers (by being interactive systems)

Does the training/inference cost asymmetry change any of the economic reasoning in gwerns tool AI?

is simple slowness an argument about the china brain argument against functionalism? (I.e. the input and outputs must be included in the functional definition)

Surely plans arent the only future focused attitudes that can ground prudential reasons. Even if they are we need to avoid the sadistic conclusion for animal lives not worth living

Do LLMs support or undermine chomsky on the poverty of the stimulus? Depends on the relative input sizes during training I guess

Current takeoff speeds (not when-if we foom) seem perfect for making people care about alignment and digital minds, they’re fast enough that people who say some skeptic bs get disproven soon enough to look stupid

Will AI systems still have large enough oppourtunity costs for humans to have comparative advantages anywhere?

Will AIs cause unemployment where other tech hasn’t because it’s so cheaply copyable? But so is all software, and much of it doesn’t lead to mass unemployment. Maybe it’s something to do with how it affects productivity of people already trained to do the task it does, vs. people not yet trained/the AI acting solo?

One way the transhumanist origins of ai x-risk concerns made things worse is by undermining the desirability of simply stopping ai progress bc they cared so much about the upsides, especially their own immortality. Recall katja grace’s comments on this stuff.

How did nanotech thought fit in to the same eschatological frame?

Bird and Layzell (2002) used genetic algorithms to evolve a design for an oscillator, and found that one of the solutions involved repurposing the printed circuit board tracks on the system’s motherboard as a radio, to pick up oscillating signals generated by nearby personal computers.

The fact species dumber than humans don’t cluster near us in int suggests were not the hard limit for useful intelligence

Word2vec distances between embeddings match time-delays on implicit bias tests, suggesting both reflect the same conceptual structure.
Can use word embedding across times or subcultures to measure differences in biases

Also existing risk of predictive systems increasing existing biases, e.g. by predicting black neighbourhoods are more likely to commit crime due to higher arrest rate. Feedback loop

EU law about right to explanations for ML decisions is an application of philosophical work on explanation

Dawes work on clinical and statistical prediction suggests almost all the work is in knowing what variables are relevant, not in how you weight or combine them. In fact random weights will beat clinical predictions on e.g. college performance. So simple models can do great (at least on input data that is predigested, like diagnoses vs. raw scans)

Kerr’s paper on incentives in business for goodharting

Non-conservative fields

Interesting to look at evolutionary simulations of reward functions and how the behaviours/preferences in simulations differ from real world ones. The areas where they do differ should be the ones where embedded agency is practically relevant.

Reward shaping is the essence of gameification. Principles include avoiding sparse rewards, and building up attentional and motivational systems on easier to learn games with denser rewards. For example unavoidable 1up mushrooms in mario bros. So gamification makes us the kinds of agents that good reward shaping makes ML ones, competent, but prone to goodharting. Is there anything else?

Curiosity seems surprisingly simple mathematically re prediction density. Seems like curious agents can turn zero sum to positive sum games with pong.

Knowledge seeking agents may be able to avoid wireheading because it isnt surprising

Possibilist actualist discussion in philosophy, and policy free(?) Learning in ML, and choosing the second best in economics all correspond to the ideal agent (possibilist) vs. consequentialist (actualist) virtue ethics

How relevant is the IDA framework and AlphaGoZero stuff for coherent extrapolated volition that brings the agents competence up along with the values getting better and better.

CIRL shows that predictability of actions, based on modelling an agent as rationally, efficiently pursuing simple goals, and legibility of actions, i.e. how easily another agent can infer your preferences/goals by observing you act, are at odds with one another. You can see this when people hand objects to another person, or try to signal theur intentions to pick up one of two glasses by sweeping an arm wide to one side instead of taking the most efficient path. This generalises somehow.

I think this is based on convergent instrumental goals. You need to violate from any convergently-instrumental course of action in order to signal that you’re doing a particular action. This is basically the idea of costly signalling, huh. I guess this means even non-deceptive AI systems will tend to be somewhat illegible, bummer.

How are legibility and predictability related? Im thinking of kafkaesque beauracracy. So these try to make things legible to them, but are massively illegible themselves. I guess theyre not very predictable either? Are their subjects unpredictable to them in some sense since theyre legible to them? I dont think so. How do these questions about states making things legible relate to us making our action legible, or AI legible wrt to their values/interpretability? See seeing like a state.

I had a note about the tradeoffs of transparency or something somewhere this is relevant: https://arxiv.org/abs/1606.03490
Question of whether to share your journal notes rhymes with MAIM deterrence through transparency

Best practice in human training of task swapping works well for AI human pairs too. Whats it called?

Will eternal recurrence style AI assistants that grease the wheels of our desires make virtue easier or harder?

Open category problem is interesting

Skills and traits are hard to learn, but information is not, see criticial thinking classes vs. medicine lectures. Why is this? It seems like theres a surprisingly deep or real distinction between the two.

Knowledge of ones own ignorance or uncertainty can be efficiently modelled by running a system with drop-out on, repeating the prompt and averaging over results. This is cheaper than bayesian neural nets. Might human self- knowledge of ignorance work the same way? By running forked/subagents and averaging over outcomes?

Stepwise reachable goals and attainable utility preservation are two options fir low impact agents

Analogies of attainable utility preservation as an approach to moral uncertainty.

Exchange rate of speeding up progress vs reducing extinction risk is something like 1000yrs progress = reducing risk by 1 millionth of 1 percent.

Long reflection is a CIG for all self-confident(?) moral theories, not nietzscheanism?
Moral progress and philosophy instead view moral philosophy as a bargaining game between different theories. Do they all want the same things? Probably not.

Ergodicity, finite discrete and transparent world states, stable goals, cartesian world agent relations, solo agents, etc. are many unrealistic assumptions of current ML. Might these be barriers?

Violating assumption of IID leads to cascading errors

Lower bound for nn efficiency is brain organelles

Unified framework for assessing reliability of testimonial evidence of consciousness would be good, pretty sure Rob Long or someone else from the CAIS fellowship has work on this

If you really want to know what it’s like to be a blob of cognition detached from higher brain tissues, fall asleep and dream. AI makes some of the same mistakes as dreaming humans - if you look at the lucid dreaming community, they’ve been saying forever that the two easiest ways to realize you’re in a dream are mangled hands and mangled text. AI also mangles these things, because they require some of the most processing and top-down modulation from higher-level world-models, and that seems to involve either a higher-scale connected network or something special about the dorsolateral prefrontal cortex (https://www.astralcodexten.com/p/somewhat-contra-marcus-on-ai-scaling)). (Scott Alexander)

What differentiates capabilities that emerge suddenly with scaling vs those that don’t? How does the distinction affect the value of different policies and safety techniques? E.g. the risk of abilities like deception emerging without warning shots, or capabilities that define levels in RSPs skipping levels. What abilities are we most worried about not getting warning shots for, and what evals, safety techniques, and governance practices depend on warning shots and which do not. I guess compute-amount defined reporting and other precautions like in SB 1047 don’t depend on warning shots which seems smart.

Progress on adversarial robustness on image classification is super slow, despite the smaller attack surface (perturbing a grid of pixels), yet the massive attack surface of LLMs hasn’t prevented much greater gains in adversarial robustness. What’s up with that?

How might the capgras and fregoli delusions relate to the ae studio self-modelling stuff (or other delusions idk if these are the right ones). Maybe more like the being full of bugs or glass ones. Monothematic delusions

What makes people have the virtue of filial piety, and how could you instill it in AIs? Just regular emergent values lol I don’t think this frame helps.

Monothematica

Explorer

The Alignment Problem and AI safety notes

Graph View

Backlinks