Alignment overview

0.1. Hardness of Specification
0.2. Outer Alignment
0.3. Inner Alignment
0.4. Perils of Predictors

1. Summary

1.1. The problem of specification
1.2. Outer alignment
- 1.2.1. Goodhart’s law
- 1.2.2. Information asymmetry
1.3. Inner alignment

1. Summary

Roughly this list seems to be a response to the “Why can’t we just set the right reward function?” argument. It breaks the problem down into three “levels”:

The problem of specification --- How do you communicate what you want?
The problem of information asymmetry (outer alignment) --- Giving reward signals to a superhuman AI for its behaviour is fundamentally subject to “information asymmetry”: you are not good enough to correctly evaluate the AI’s behaviour. How do you ensure your reward signals reflect your extrapolated volition (i.e. incorporating all the information in the world) rather than your immediate, superficial volition?
The problem of being weak and pathetic (inner alignment) --- To a superhuman AI, your reward function is a joke. You have a being of unfathomable intelligence and you are trying to edit its brain bit-by-bit with something called “backpropagation”. The reward function is not the utility function, it is your puny and laughable attempt to shape a superhuman AI into something desirable.

In my view:

the problem of specification is definitely fake as far as alignment risk is concerned, but regardless studying human values formally is valuable.
outer alignment is definitely real, but seems solvable.
inner alignment — I would attach only ~67% probability this will be a problem, but if it does, it seems hard to solve.

(Note: EY’s “The Hidden Complexity of Wishes” was included in the list of articles for the problem of specification. My understanding is that EY did not use this argument in this sense, but rather to make a point like “human values are complex, so AI will not automatically have them”, which is an argument to a different point. Nonetheless I will quote from that article as if it were indeed about the problem of specification, because interpreted that way it is a good steelman for the problem of specification.)

The problem of performative action. There is also a fourth matter: even if you align an AI perfectly to human values, it may take actions which change human values—and similarly if you align an AI perfectly to be “maximally truth-seeking”, it may take actions/make predictions that change reality to improve its prediction accuracy. It is not clear how to model this risk, however, because there is also “legitimate value change”, such as from learning more information.

I think this is fundamentally related to inner alignment though. A misaligned AI changing its reward mechanism is just one (arguably weaker) way of disrespecting your reward mechanism entirely. E.g. in an economics analogy, this is analogous to McDonalds doing a “body positivity” psy-op—which, if you were fully informed of the consequences in advance, you would refuse to view such advertisements, but do not have this choice and so your ability to assign reward properly is stifled.

1.1. The problem of specification

To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish … With a safe genie, wishing is superfluous. Just run the genie. –from Yudkowsky’s The Hidden Complexity of Wishes

The argument is: every stated “wish” contains many implicit wishes therein, e.g. “I wish I had more free time” includes “but I don’t want to be unemployed”. Therefore if a genie could safely fulfill wishes, it must already know all your implicit wishes—but with a genie that already knows your wishes, you don’t need to specify your wish at all.

This seems like an overstatement to me. For example, humans fulfill each other’s wishes all the time, without being omniscient of each others’ wishes in advance—so do LLMs. It is possible for the genie to have a good prior on what humans want, while still using “wishes” to update to more precise beliefs.

“The pointers problem”—AFAIK first introduced in Wentworth’s The Pointers Problem: Human Values are a function of latent variables—tries to formally study the “shape” or “type” of human values. We often pretend like humans just have a utility function that we maximize the expectation of—well, the first problem with this model is bounded rationality, but let’s say we can somehow talk sensibly about “expected utility conditional on logical/algorithmic information”—but … what exactly is the utility function a function of?

In game theory, it is a function over people’s actions
In economics, it is a function over the goods you get
… perhaps we can say it is a function over “the state of the world”?

But this doesn’t exactly make sense positivistically — it is not a function of your senses.

This is weird—usually you think that physically meaningful things need to have some kind of “invariance” with respect to changes of latents.

Here is a list of meaningless questions in physics (that are trivial once you understand logical positivism, but hopeless otherwise):

Are gauge ghosts actual particles?

Does the electron have a position in the ground state of Hydrogen, if you aren’t measuring where it is?

Are confined particles, like quarks, actual particles?

Do objects cross the black-hole event horizon or get smeared on the surface, never falling through?

Are Green-Schwarz superstrings the same as RNS superstrings?

I could go on all day. All these superficially sensible questions are obviously nonsensical in logical postivism, and to simply study physics, you can’t avoid imbibing the entire philosophy right at the outset, and definitely when you study quantum mechanics.

–from Ron Maimon

So does it make sense if my values depend directly on unobservable variables such as “the happiness of people I will never meet”? If my values distinguish between “my grandparents returning to life” and “me very-vividly hallucinating a world where my grandparents are back to life”?

If you think that is meaningful, then is it also meaningful if my values distinguish between e.g. reference frames or co-ordinate systems?

1.2. Outer alignment

The articles in the reading list focused on motivating why outer alignment is important, rather than outer alignment itself. I will discuss both quite briefly.

1.2.1. Goodhart’s law

Specifying the exactly correct reward is important, because specifying even a slightly-wrong proxy reward can lead to catastrophically different results.

Garrabrant pointed out there are actually four different things called “Goodhart’s law”:

1.2.1.1. Causal goodhart aka cargo-culting

This is the origin of the term from econometrics: /optimizing for a proxy only helps if the causal relationship between your *P*roxy and your *U*tility is \(P \to U\) (well, and also the relationship should be increasing).

If instead the relationship is \(U -> P\), you can increase \(P\) by increasing any other cause \(X \to P\) — and it might be easier to increase \(X\) than to increase \(U\) (in general it is very likely that such another cause exists).

Similarly if both have a common cause, then it’s possible \(P\) has other causes, or can be intervened on directly \(\mathrm{do}(P)\to P\).

Examples:

US railroads in the 1800s being rewarded by the government for length
cobra effect
Lucas critique
the AI in that game driving a boat in circles
Cargo-culting
Using (e.g. education, military) “spending” as a proxy for outcomes in govt. So you have ever-increasing govt spending on schools without any improvement in outcomes.
Groupthink — taking consensus opinions as a proxy for truth
Bubbles
Using prices as a signal for quality

1.2.1.1.1. Adversarial goodhart

… in particular your enemies can take actions to correlate their own goals with \(P\). I am not totally convinced this is fundamentally a different category, the “cobra effect” is generally classified as adversarial goodhart but also clearly causal goodhart.

Or maybe everything I wrote under “Causal goodhart” is actually “Adversarial goodhart”, and “causal goodhart” is actually something else? I don’t really understand the description in the Manheim & Garrabrant paper.

1.2.1.2. Extremal goodhart

The correlation between \(U\) and \(P\) might simply break (and reverse) when \(P\) is large. The strongest reason to expect this to be true is the law of diminishing marginal benefits and increasing marginal costs.

Suppose both \(U\) and \(P\) depend on some common resources. In the normal regime, you would focus on lowest-hanging fruits improvements to \(P\) (i.e. you are in the high-marginal-benefit regime for \(P\) in terms of consuming the common resources), which would have minimal negative impact on \(U\) since you are still in the low-marginal-cost regime for \(U\). But as \(P\) increases, you want to squeeze out every last bit of benefit for \(P\), and you need to impose large costs on \(U\) to do this.

The best example I know for this is ultra-processed foods. There’s no fundamental reason why “processing” ought to be bad for health, yet it is widely observed that foods that come under this category happen to be bad for health through a variety of unrelated mechanisms. The reason is that ultraprocessed goods represent the absolute pinnacle of human taste, with every fathomable optimization done. So if you can squeeze out the slightest improvement in taste/color/texture for a massive cost to health (like adding sugar to sauces, using palm oil to keep things solid, or refined flours for crispiness), you will do it, because the former affects consumer purchasing decisions while the latter doesn’t (at least for 99% of the consumer base).

1.2.1.3. Regression goodhart

This seems to just refer to the obvious and ever-present fact that the proxy and target are not the same, so the maximum of the former is unlikely to be the maximum of the latter. E.g. IQ is positively correlated with height, but it’s unlikely the tallest person in the world is also the smartest.

This can probably be formalized with some extremal value statistics. E.g. if you assume a multivariate normal

\[(P, U) \sim \mathcal{N} \left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} \right)\]

Then what’s the distribution of \(\max U - u\left( \arg\max_{(P, U)} p(P, U) \right)\) where \(u: (P, U) \mapsto U\) and \(p: (P, U) \mapsto P\)? IDK, but the mean of the distribution will be positive.

1.2.2. Information asymmetry

There weren’t any articles in the list about outer alignment itself, just motivating why it’s important (Goodhart’s law). I will briefly state how I like to think about it:

Outer alignment is the problem of purchasing (information or decisions) under information asymmetry. You can think of (a simplified version of) RLHF as “purchasing information from an AI” (where your rating is the “price” you offer it). But all the economic theorems about how beautiful and perfect markets are (e.g. the first fundamental theorem of welfare economics) only apply under the assumption of perfect information.

Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better. But because the badness is nonobvious/nonsalient, it doesn’t influence my decision-to-buy, and therefore companies producing the good are incentivized not to spend the effort to make it better. –John Wentworth, My AI model delta compared to Paul Christiano

When you are rating outputs from a superhuman AI, you are by-definition evaluating things you are not as informed as the AI on; and therefore the AI has no reason to optimize on the things that do not affect your rating.

The idea behind scalable oversight is to develop better mechanisms for obtaining information, i.e. to fix information markets.

1.3. Inner alignment

The basic crux of the inner-alignment problem is: reward is not the utility function. The article that really convinced me on AI alignment being a serious issue was: Reward is not the optimization target; there is also the more popular Utility ≠ Reward.

The obvious example of this: humans were optimized for reproduction, but this is not all that human utility functions entail. In particular: the current environment differs from the ancestral environment in many ways, e.g. the presence of birth control; as a result humans have become very bad at pursuing this goal.

Shah & Varma (2022) is a good paper. They provide a number of simple examples of this “misgeneralization from training to the testing environment”. I think the following one is sufficient to make the point:

An agent is rewarded for gathering some coins in an environment where there also happens to be a helpful “leader” who traces the same correct paths. When evaluated in an environment with a malicious leader who instead walks into traps, the agent also walks into traps—even though it was rewarded for the correct thing (collecting coins).

They also give a theoretical framework for goal misgeneralization.

I think this form of the inner alignment problem — goal misgeneralization — is maybe not catastrophic. It is just about insufficient training. If you train on a sufficiently wide range of training environments, the agent will do meta-learning and learn to model its own reward in new and foreign environments. (I THINK THIS IS A VERY IMPORTANT POINT AND EXPLAINS EXACTLY WHERE HUMAN “VALUES” AND “UTILITY FUNCTIONS” EVEN COME FROM. MAYBE THIS ARTICLE: We Don’t Know Our Own Values, but Reward Bridges The Is-Ought Gap by John Wentworth IS RELEVANT. Risks from Learned Optimization ASKS: WHEN DOES A TRAINED SYSTEM BECOME A MESA-OPTIMIZER? I THINK THIS IS THE ANSWER.)

For example, evolution would eventually still select for reproduction, by making an Amish-dominated world or whatever. And eventually, over millions of years as we produce a huge dataset of training environments, it will eventually select for the sub-sub-sub-Amish population that really wants to reproduce and intelligently thinks about how to reproduce the most, because only they will be able to successfully reproduce in absolutely any environment.

Similarly: while it is true that our markets and supply chains are mostly optimized for the “current” environment; if there is a massive shock at some point, it’s not like our companies will just keep doing the same thing: they will adapt, in pursuit of the new reward.

I think the stronger problem is: while the training mechanism might, in online learning, eventually train the correct utility function—this depends on the training mechanism existing for a very long time. When the training mechanism produces a very intelligent being, it can just destroy your training mechanism before that. For example, humans might just blow up the whole world and then also not reproduce.

I was not able to read the last article The Inner Alignment Problem in time. It seems important.

Alignment overview

Table of Contents

0.1. Hardness of Specification

0.2. Outer Alignment

0.3. Inner Alignment

0.4. Perils of Predictors

1. Summary

1.1. The problem of specification

1.2. Outer alignment

1.2.1. Goodhart’s law

1.2.1.1. Causal goodhart aka cargo-culting

1.2.1.1.1. Adversarial goodhart

1.2.1.2. Extremal goodhart

1.2.1.3. Regression goodhart

1.2.2. Information asymmetry

1.3. Inner alignment