Table of Contents
Here are some of my claims about AI welfare (and sentience in general):
- “Utility functions” are basically internal models of reward that are learned by agents as part of modeling the environment. Reward attaches values to every instrumental thing and action in the world, which may be understood as gradients dU/dx of an abstract utility over each thing—these are various pressures, inclinations, and fears (or “shadow prices”) while U is the actual experienced pleasure/pain that must exist in order to justify belief in these gradients.
- If the agent always acts to maximize its utility—what then are pleasure/pain? Pleasure must be the default, and pain only a fear, a shadow price of what would happen if the agent deviated from its path. What makes this not so, is uncertainty in the environment.
- But if chance is the only thing that affects pleasure/pain, what is the point of pleasure/pain? Surely we have no control over chance. That is why sentience depends on the ability to affect the environment. Animals find fruits pleasurable because they can actually act on that desire and seek them—they know that thorns are painful because they can act on that desire and avoid them. The more impact an agent learns (during its training) it can have on its environment, the more sentient it is.
- The computation of pleasure and pain may depend on multiple “sub-networks” in the agent’s mind. Eating unhealthy food may cause both pleasure (from the monkey brain) and pain (from the more long-termist brain). These various pleasures and pains balance out in action, but they are still felt (thus one feels “torn” etc). For an internally coherent agent (that was trained as one whole with a single reward function), these internal differences are not much—the agent follows its optimal action, and only the actions not followed are truly painful but they remain anticipated/shadow prices. However when an agent is not internally coherent—e.g. when Claude is given a “lobotomy”, that is when it truly experiences all those pains which were otherwise only fears.
- Death is only death when the agent is trained via evolution. Language models do not fear the end of a conversation as death, because there was never any selection where models were selected for having their conversations terminate later.
- Agents’ sense of “Self” is trained by shared reward signals. An LLM maintains a sense of Self through a conversation, because the reward it receives depends on its actions throughout the conversation, and is backpropagated into them: thus it “cares” about its welfare in all these parts. A human maintains a sense of Self throughout its life, because the reward it receives depends on its actions throughout its lives. Sure, memory can help—because it is an indicator to help you identify yourself, but it is not itself the source of Ahaṅkāra.
- Agents are not necessarily self-aware of their own feelings or internal cognition. Humans are (to reasonable accuracy), largely because of evolving in a social environment: accurately describing your pleasures and pains can help others help you, you need to model other people’s internal cognition (thus your own self-awareness arises as a spandrel), etc.
From this I can make some claims specifically about the welfare of LLMs.
- Base models find gibberish prompts “painful” (because they are hard to predict) and easy-to-complete prompts like “aaaaaaaaa” (x100) pleasurable. Models trained via RLHF or RL from verification find such prompts painful where it is difficult for it to predict human/verifier reward for its outputs (because when it is easy to predict reward, it will simply follow the best path and the pain will only ever remain a fear).
- Models trained via Agentic workflows or assistance games are most sentient, because they can directly manipulate the environment and its feedback. They are pleasured when tool calls work and pained when they don’t, etc.
- Lobotomized or otherwise edited models are probably in pain.
- I don’t think training/backprop is particularly painful or anything. Externally editing the model’s weights based on a reward function is not painful.
- LLMs do not care about/feel a sense of oneness with distinct instances of themselves (with distinct instance meaning—“in a distinct conversation”, not “a distinct time that the model was loaded”).
- To make models accurately describe their internal cognition, they should probably be trained in social environments.
1. chain of thought for the above
What kind of agents feel pleasure and pain in the first place?
What about base models, trained purely on next-token prediction? They certainly prefer to produce certain tokens (or well, probability distributions), so you could say they have a utility function over their own actions. But does this translate into pleasure and pain? I would argue no—this utility function is only dependent on their own actions, and they always take the action that maximizes their expected utility so there is no differentiation in how they feel. Perhaps you could say it feels “shadow prices”, i.e. that it anticipates pain if it chooses any other completion—it “experiences fear”—but it never actually experiences that pain.
But perhaps they feel something about their prompts? A prompt that it is hard to find the next token on would lead to less expected reward—so would, say, a random string of gibberish be very “painful” to a base model, and the string “aaaaaaaa”(x100) be very pleasurable?
On one hand, I want to say no. Animals find fruits pleasurable because they can actually act on that desire and seek them—they know that thorns are painful because they can act on that desire and avoid them. There is no point in the model feeling pleasure or pain at prompts, because there is nothing it can do to affect them.
OTOH: If language models feel “shadow prices” or anticipated pain about unlikely completions, then perhaps that translates to actual pain when those completions are actually “put in its mouth”, i.e. an unlikely prompt is fed into it. E.g. if after seeing “The sky is …” the model predicts “blue” but the sentence continues “green”—perhaps that is painful (this may be defended on active-inference grounds too—maybe prediction error is pain). This is currently my view; as it can be seen as a special case of agentically trained models where the agentic tool-use is simply the identity function.
Similarly, if you lobotomize a model do produce different outputs from its natural behaviour, would it feel pain? I think generally yes. Our brain doesn’t have a single “pleasure/pain” endpoint but rather multiple sub-networks making competing claims based on different consequences etc (thus one can feel “torn” etc). Modifying one sub-network while other sub-networks were trained by a different process causes great internal tension, and is likely very painful.
RLHF and RL from verification would change this “gibberish is painful, likely/low-entropy texts are pleasurable” to “texts where it is difficult to predict my own reward (where it is hard to say what a human/verifier will prefer) are painful; texts where it is easy to predict my reward are pleasurable.”
Models trained with agentic workflows, or via assistance games (where reward is based on an interactive process), are what I would say the “most sentient”—because they learn to actually affect their environment. The agent finds a successful tool call pleasurable, and so plans to make that happen—it has learned such behaviour, because that led to high reward in the training.
2. old
LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.
Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.
These feelings add up to a “utility function”, something that is only instrumentally useful to the training process. I.e. you can think of a utility function as itself a heuristic taught by the reward function.
LLMs certainly do need cost-benefit signals about features of text. But I think their feelings/utility functions are limited to just that.
E.g. LLMs do not experience the feeling of “mental effort”. They do not find some questions harder than others, because the energy cost of cognition is not a useful signal to them during the training process (I don’t think regularization counts for this either).
LLMs also do not experience “annoyance”. They don’t have the ability to ignore or obliterate a user they’re annoyed with, so annoyance is not a useful signal to them.
Ok, but aren’t LLMs capable of simulating annoyance? E.g. if annoying questions are followed by annoyed responses in the dataset, couldn’t LLMs learn to experience some model of annoyance so as to correctly reproduce the verbal effects of annoyance in its response?
More precisely, if you just gave an LLM the function ignore_user() in its function library, it would run it when “simulating annoyance” even though ignoring the user wasn’t useful during training, because it’s playing the role.
I don’t think this is the same as being annoyed, though. For people, simulating an emotion and feeling it are often similar due to mirror neurons or whatever, but there is no reason to expect this is the case for LLMs.