Tuesday, April 16, 2024

Existential risks from AI?

A recent Policy Forum article in Science argues for banning certain uses of artificial intelligence (AI) (Michael K. Cohen et al., Regulating advanced artificial agents. Science 384,36-38 (2024). DOI:10.1126/science.adl0625). The authors particularly worry about agents that use reinforcement learning (RL).

RL agents "receive perceptual inputs and take actions, and certain inputs are typically designated as 'rewards.' An RL agent then aims to select actions that it expects will lead to higher rewards. For example, by designating money as a reward, one could train an RL agent to maximize profit on an online retail platform." The authors worry that "a sufficiently capable RL agent could take control of its rewards, which would give it the incentive to secure maximal reward single-mindedly" by manipulating its environment. For example, "One path to maximizing long-term reward involves an RL agent acquiring extensive resources and taking control over all human infrastructure, which would allow it to manipulate its own reward free from human interference."

I may be missing something here, but it seems to me that the authors mis-characterize RL. In psychology, reinforcement learning does not require that the organism (or machine) place any value on reinforcement. The process would work just as well if a reinforcement ("reward") were simply an increase in the probability of the response that led to it, and a "punishment" were simply a decrease. The organism does not "try" to seek rewards or avoid punishments in general. It just responds to stimuli (situations) from a menu of possible responses, each with some response strength. The strength of a response, relative to alternative responses, determines its probability of being emitted. "Reward" and "punishment" are terms that result from excessive anthropomorphization.

It would of course be possible to build an AI system with a sense of self-interest, in which positive reinforcements were valued and purposefully sought, independently of their role in shaping behavior. But this system would not do any better at the task it is given. It might do worse, because it could be distracted by searches for other sources of "reward", as Cohen et al. suggest.

If, for some reason, AI engineers thought that a sense of self-interest would be useful, they could design a system with such a sense. It would need a feature of each possible outcome indicating its overall consistency with long-term goals (including the goal of having good experiences). And it would have to represent those goals, and processes for changing these goals and their relative strengths.

Engineers could also build in a sense of morality, so that a decision-making AI system would, like most real people, consider effects on others as well as on the self. In general, options would be favored more when they had better (or less bad) outcomes for others, and when they had better (or less bad) outcomes for the self.  Effects on others would be estimated in the same way as effects on the self, in terms of the consistency of outcomes with long-term goals.  Such a sense of morality could even work more reliably than it does in humans. The functional form of the self/others trade-off could be set in advance, so that psychopathy, which gives too little relative weight to effects on others, would be avoided.

If self-interest is to be included, then morality should be included too. It is difficult to see why an engineer would intentionally build a system with self-interest unchecked by morality. That seems to be the sort of system that Cohen et al. imagine.


No comments:

Post a Comment