Sign up & Download
Sign in

Reinforcement Learning: A Survey

by L P Kaelbling, M L Littman, A W Moore
Journal of Artificial Intelligence Research ()

Abstract

This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Reinforcement Learning: A Survey -

Abstract paper spectiv Itis written to b e accessible to hersfamilia rwith trial-and-errorin teractionswith a dynamic orkdescrib blanceto orkin ,but considerablyin detailsand ofthe ord\reinforcemenneuroscience,dates Thepap er discusses tralissues of includingtrading o exploration exploitation, the foundations of fromdelascience.ewIn models to accelerate learning, use generalization and hiddenstate. Itconcludes with a eyof some tedsystems and an ofthe practical yof tmetho ds for tlearning. 1. ductionhow Reinforcemenecifythisratherpromise.bmachineofdynamice tb learningtheuse bac k to the days of cyb erneticsand psychology,the and the last ve to ten has attracted rapidlyandmaking increasing in terest inay the learning and articial in telligencecomputationaltviort.isandtunities.edbtstatistics,vironmentheincommthoughentlearning,per-hasaofprogramming,itorkofwcopingalldiersin,andbehaears,t.yhinelearning.andin Its promise is eguiling|a w of programming agenints b y rew ard and punishmen without needing to sp the task is to b e acearly hiev ed. But there are formidable obstacles to fullling the This pap er surv eys historical basis reinforcement learning some curren w from a computer science p ersp ectiv e. giv a high-leve el o v erview the eld a taste of some sp ecic approacofcomputer hes. It is, of course, imp ossible to tion of the imp ortan t workIntroestablishingof ork in the eld should not e tak en an exhaustiv accoun Reinforcement learning is the problem faced b y an agen t that m ust learn eha through trial-and-error in teractions with a en vironmen t. The w ork describ here has a strong family resem blance to ep on ymous ork psyc hology , but considerably in the details and in the the w ord /reinforcemen t." It is appropriately of as a class of problems, than as a set of tec hniques. There are t w o main strategies for solving reinforcemen t-learning problems. The rst to search in the space of b eha viors in order to nd one that p erforms the This approac h has b een tak en b y w ork intoWbe genetic algorithmsmenwell and genetic c 1996 AI Access F oundation and Morgan Kaufmann Publishers. All righ ts reserv ed.
Page 2
hidden
a T r I R as well asthat some more novorld.s search techniques (Schmidhub er, 1996).reinforcement-learninghusetotakingisSectionthettheofofthetvironmenyclearstatetthatthereinforcemenetagenyofsecondenthenotopttial3canGeneralization|thedescribingTheisvironmenosett.ItconsidersItof7ofb) statistical tec hniquese and dynamiccircumstances. programming metho ds to estimate the utilit of actions in states the w This pap erexplainscritic,structure isthe dev oted en tirely to second set of tec hniques b ecause they tak e adv an tage of the sp ecial problems is not a ailable optimization problems in general. whic set of approaca hes is b est in whic h The rest of this sectionto is dev oted to establishingalmostthe notation and the basic reinforcement-learningv mo del. Section the trade-o et w een exploration and exploitation and presen ts some solutions most basic case of reinforcemen t-learning problems, in whic w w an maximize the immediate rew ard. Section considers the more general problem in whic h rew ards can b dela in time from the that w ere crucial to gaining them.9 Section 4 considers some classic del-free algorithms for t learning from delay ed rewttel ard:in adaptiv e heuristic T D ( and Q-learning. 5 demonstrates con uum of algorithms that are sensitiv e to the amoun of computation an agen t can p erform et w een actual steps of action in the en vironmen cornerstone of mainstream mac hine learningyed researc h|has p oten considerably aiding reinforcement as describ ed in Section 6. Section problems thatp arise when the agen t do es not havetoB completeagentlearning'sactions p erceptual access to the en vironmeninput,hof,btinlearning,t-learningof2itransitionmo t. Section 8 catalogs some of reinforcement successful applications. Finally , Section concludes some sp eculations ab out imp ortan problems and the future of reinforcemen learning. 1.1 Reinforcemensome t-Learning del In reinforcemen mo del, is connected to its en via erception and action, depicted in Figure 1. On eac stepc in teraction agen t receiv esstandard as i indication of the curren t state, s ,hof the en the agen t then c ho oses an action, a , to generate as output. The action the state en vironmenyt, and the valueaswithMostatee ofr this is communicated to the through a scalar r einfor c ement signal , . The agent's b eha vior, B , should c ho actions that tend to increase the long-run sum of v alues the reinforcemenariety t signal.hanges learn to do this o verthe time b systematic trial and error, guided byan a wide v of algorithms are the sub ject of later sections of this pap er.
Page 3
hidden
with the following example En vironmen t: Ydialogue. ou are in state You have 4 p ossible actions. Agen t: I'll tak e action 2.65. En vironmen t: Y ou receivedve a reinforcementt of 7 units. You are now in state 15. Y ou ha 2 p ossible actions. Agen t: I'll tak e action 1. En vironmen t: Y ou receiv ed a reinforcemen of -4 units. Y ou are no w in state 65. Y ou ha v e 4 p ossible actions. Agen t: I'll tak e action 2. Environmento t: Y ou receiv ed a reinforcemen t of 5 units. Y ou are no w in state 44. Y ou ha v e 5 p ossible actions. . . . . . The agen t's jobthat ndvhange a olicy mappingHowev states actions, maximizes long-run measure ofisreinforcement. e,tall, in general, that2 the en vironmen tdierenebtardsomeofwillsystemspone non-deterministic is, that taking the same action into thee same state ondiering two t o ccasions ma y result in dierenp t next and/or dieren t reinforcement v alues. This happ ens in. our example ab o e:on from state 65, applying action pro duces rein- forcemen and diering states twWostatesect,is o ccasions. er, w assume the en vironmen is stationary thattransitions is, that the pr ob abilitiesexp of making state transitionsthat or receiving ecic reinforcement signals do not c o v er time. 1 Reinforcemen t learning diers from the more widely studied problem of sup ervised learn- ing in sev eral w a ys. The most imp ortan dierence is that there is no presen tation of in- put/output pairs. Instead, after c ho osing an action the agent is told the immediate rew and the subsequentt state, but is not told whic h action w ould ha ve een in its b est long-term in terests.ts It is necessary for the agen t to gather useful exp erience ab out the p ossible states, actions, and rew ards activ ely to act optimally .b Another dierence from sup ervised learning that on-line p erformance imp ortan t: the ev aluation of the system is often concurren with learning. 1. This assumption mais y b e disapp oinsystems. ti ng after op eration in non-stationary en vironmen ts is the motiv ations for buildin g learning In fact, man y of the algorithms describ ed in later sections are eectiv e in slo wly-v arying non-stationary en vironments, but there is v ery little theoretical analysis in this area.
Page 4
hidden
1.2 Mo Optimaldels Before wdels e canofour startmo thinking ab out algorithmst forsub learningitto to decide what del of optimalityeet,whates)discounalwpatmoamakthatamoprobabilitby willhwill b e. In particular, w ha v ho the agen should tak e the future in to accounbt in the decisions es ab to eha v e no w.tttime. There are three mo that ha v eenthink the of the ma jorit of w in this area. The nite-horizon mo del is the easiest to ab out giv en in time, the agen should optimize exp ected rew ard for the nextject h steps: E X t =0 r it needcnott worry ab out what will happ en after that. In this and t rte represenbts the scalar ard receiv ed steps in to the future. This del can t w o w ys. In the rst, the agen t will ha v non-stationary olicy one o v er On rst step it will takgiva(moa e is h-step optimal . This is dened to e the est action a v ailable en that has steps remaining in whic h act and gain reinforcemenesitsrewBehavior1): t. On next step it take a ( h 1)-step action, and so on, un til it nally tak a 1-step optimal action and terminates.optimal In the second, do es ra e e ding-horizon c ontr olthe , in whic h it alw ys tak es the h -step action. The agen t alwa ys acts according to the same p olicy but thetermedha alue of h limits ho ahead it oks in c ho osing its actions. The del is not a ys appropriate.actionaccordingtinloagenlivingtobevhawehangesyusedcweattenthet,orkexpressions,iswthatecifyttedInhomathematicallyspyfartomooptimallyoutis,ewmomenevoptimalThesubsequenisehaebrate, man cases w ma y not knoitsb w the precisenite-horizontb length of the agen t's life in adv ance. innite-horizon discoun ted mo del tak the long-run rew ard of the agen in to ac- coun t, but rew ards that are receiv ed in the futureitv are geometrically discoun to discoun factor , (where 0 E ( 1 X =0 t rt ) : W e can interpretas in several ways. It can e seen as an in terest of another step, or a tric k to b ound the innite sum. del conceptu- allyThe similar to receding-horizon con trol, but the ted del more tractable than the nite-horizon mo del. This is a dominan t reason for the wide tion this mo del has receivmathematical ed.
Page 5
hidden
long-run average p erformance.vironment ossible to mo del so it takeswhic9,to:0in0,:=gained.actionisybdeledguar-plateauaseea accoun t b oth the long run erage and theinnite-horizontothis amount ofpro initial rew ard can b In the generalized, bias mo del, olicygeneralizepviding is preferred if it maximizesThere the long-run a vsingle erage and ties are brok en b y the initial extra rew Figure con trasts these mo dels of y b an enthan vironmen t in h c hanging the mo of optimalit y hanges the olicy . In this example, circles represen t the states of the en and arro ws are state transitions. is only a action c hoice from ev ery state except the start state, whic h isathat in the upp er left and mark ed with an incoming arro w. All rew ards are zero except where mark ed. Under a nite-horizon mo del withoptimalItcis:cpasthea h 5, the three actions yield rew ards of +6 : 0, +0 :0,t and +0 so the rst action should b e c hosen under an discoun ted mo del with the three c2 hoices yield +16=avis : 2, +59 0, andap +58 soy the second action should b hosen and under the a erage rew mo del, the third:5deadlineinnite-horizon action should b e c hosen since itmoc leads to an averageoptimalitBias-optimality rewrelativethe ard of +11. If w e hangeoptimalitoptimal h and 0.2, then the second is optimal for the nite-horizon mo del and the rst the discoun ted mo del ho w ev er, the ayv erage rew ard mo del will alw a ys prefer the b est long-term vkno erage. Since the c hoice of y mo del and matters so m uc it is imp ortan to c ho ose it carefully in an application. The mo del is appropriate when the agenh, t's lifetime is wn one im- p ortan t asp ect this mo del that the length of the remaining lifetime decreases, the agen t's p olicy ma y c system with hard w ould b e appropriately mo this w a y .nite-horizon The usefulnessparametersto of innite-horizon discoun ted and bias-optimal dels still under debate. has an tage of not requiring a discounlearned t parameter ho wevalgorithm. er, algorithms for nding bias-optimal p olicies are not y et as w ell-understo d those for nding optimal innite-horizon discounadv1000for ted p olicies. 1.3 Measuring LearningAarderformanceterms.toard. P The criteriaofgivofvdelhange. en in previous section can b e used to assess the& p olicies a giv en W e w ould also lik e to b e able evvior aluate the quality ofan, learning itself. There are sev eral incompatible measures in use. Even tual con v ergence to optimal. Man y algorithms come with a proo v able an tee asymptotic convergence to optimal b eha (W atkins Da y 1992). This is reassuring, but useless in practical An agen t that quic kly reac hes

Readership Statistics

410 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
44% Ph.D. Student
 
15% Student (Master)
 
8% Post Doc
by Country
 
24% United States
 
10% Germany
 
10% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in