Towards agi agent safety by iteratively improving the utility function

1Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present work on an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent’s utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function’s encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off. An AGI agent may develop the emergent incentive to manipulate the above utility function improvement process, for example by deceiving, restraining, or even attacking the humans involved. The safety layer will partially, and sometimes fully, suppress this dangerous incentive. This paper generalizes earlier work on AGI emergency stop buttons. We aim to make the mathematical methods used to construct the layer more accessible, by applying them to an MDP model. We discuss two provable properties of the safety layer, identify still-open issues, and present ongoing work to map the layer to a Causal Influence Diagram (CID).

Cite

CITATION STYLE

APA

Holtman, K. (2020). Towards agi agent safety by iteratively improving the utility function. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12177 LNAI, pp. 205–215). Springer. https://doi.org/10.1007/978-3-030-52152-3_21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free