Reinforcement Learning: A Survey -
Abstract paper spectiv Itis written to b e accessible to hersfamilia rwith trial-and-errorin teractionswith a dynamic orkdescrib blanceto orkin ,but considerablyin detailsand ofthe ord\reinforcemenneuroscience,dates Thepap er discusses tralissues of includingtrading o exploration exploitation, the foundations of fromdelascience.ewIn models to accelerate learning, use generalization and hiddenstate. Itconcludes with a eyof some tedsystems and an ofthe practical yof tmetho ds for tlearning. 1. ductionhow Reinforcemenecifythisratherpromise.bmachineofdynamice tb learningtheuse bac k to the days of cyb erneticsand psychology,the and the last ve to ten has attracted rapidlyandmaking increasing in terest inay the learning and articial in telligencecomputationaltviort.isandtunities.edbtstatistics,vironmentheincommthoughentlearning,per-hasaofprogramming,itorkofwcopingalldiersin,andbehaears,t.yhinelearning.andin Its promise is eguiling|a w of programming agenints b y rew ard and punishmen without needing to sp the task is to b e acearly hiev ed. But there are formidable obstacles to fullling the This pap er surv eys historical basis reinforcement learning some curren w from a computer science p ersp ectiv e. giv a high-leve el o v erview the eld a taste of some sp ecic approacofcomputer hes. It is, of course, imp ossible to tion of the imp ortan t workIntroestablishingof ork in the eld should not e tak en an exhaustiv accoun Reinforcement learning is the problem faced b y an agen t that m ust learn eha through trial-and-error in teractions with a en vironmen t. The w ork describ here has a strong family resem blance to ep on ymous ork psyc hology , but considerably in the details and in the the w ord /reinforcemen t." It is appropriately of as a class of problems, than as a set of tec hniques. There are t w o main strategies for solving reinforcemen t-learning problems. The rst to search in the space of b eha viors in order to nd one that p erforms the This approac h has b een tak en b y w ork intoWbe genetic algorithmsmenwell and genetic c 1996 AI Access F oundation and Morgan Kaufmann Publishers. All righ ts reserv ed.
a T r I R as well asthat some more novorld.s search techniques (Schmidhub er, 1996).reinforcement-learninghusetotakingisSectionthettheofofthetvironmenyclearstatetthatthereinforcemenetagenyofsecondenthenotopttial3canGeneralization|thedescribingTheisvironmenosett.ItconsidersItof7ofb) statistical tec hniquese and dynamiccircumstances. programming metho ds to estimate the utilit of actions in states the w This pap erexplainscritic,structure isthe dev oted en tirely to second set of tec hniques b ecause they tak e adv an tage of the sp ecial problems is not a ailable optimization problems in general. whic set of approaca hes is b est in whic h The rest of this sectionto is dev oted to establishingalmostthe notation and the basic reinforcement-learningv mo del. Section the trade-o et w een exploration and exploitation and presen ts some solutions most basic case of reinforcemen t-learning problems, in whic w w an maximize the immediate rew ard. Section considers the more general problem in whic h rew ards can b dela in time from the that w ere crucial to gaining them.9 Section 4 considers some classic del-free algorithms for t learning from delay ed rewttel ard:in adaptiv e heuristic T D ( and Q-learning. 5 demonstrates con uum of algorithms that are sensitiv e to the amoun of computation an agen t can p erform et w een actual steps of action in the en vironmen cornerstone of mainstream mac hine learningyed researc h|has p oten considerably aiding reinforcement as describ ed in Section 6. Section problems thatp arise when the agen t do es not havetoB completeagentlearning'sactions p erceptual access to the en vironmeninput,hof,btinlearning,t-learningof2itransitionmo t. Section 8 catalogs some of reinforcement successful applications. Finally , Section concludes some sp eculations ab out imp ortan problems and the future of reinforcemen learning. 1.1 Reinforcemensome t-Learning del In reinforcemen mo del, is connected to its en via erception and action, depicted in Figure 1. On eac stepc in teraction agen t receiv esstandard as i indication of the curren t state, s ,hof the en the agen t then c ho oses an action, a , to generate as output. The action the state en vironmenyt, and the valueaswithMostatee ofr this is communicated to the through a scalar r einfor c ement signal , . The agent's b eha vior, B , should c ho actions that tend to increase the long-run sum of v alues the reinforcemenariety t signal.hanges learn to do this o verthe time b systematic trial and error, guided byan a wide v of algorithms are the sub ject of later sections of this pap er.
with the following example En vironmen t: Ydialogue. ou are in state You have 4 p ossible actions. Agen t: I'll tak e action 2.65. En vironmen t: Y ou receivedve a reinforcementt of 7 units. You are now in state 15. Y ou ha 2 p ossible actions. Agen t: I'll tak e action 1. En vironmen t: Y ou receiv ed a reinforcemen of -4 units. Y ou are no w in state 65. Y ou ha v e 4 p ossible actions. Agen t: I'll tak e action 2. Environmento t: Y ou receiv ed a reinforcemen t of 5 units. Y ou are no w in state 44. Y ou ha v e 5 p ossible actions. . . . . . The agen t's jobthat ndvhange a olicy mappingHowev states actions, maximizes long-run measure ofisreinforcement. e,tall, in general, that2 the en vironmen tdierenebtardsomeofwillsystemspone non-deterministic is, that taking the same action into thee same state ondiering two t o ccasions ma y result in dierenp t next and/or dieren t reinforcement v alues. This happ ens in. our example ab o e:on from state 65, applying action pro duces rein- forcemen and diering states twWostatesect,is o ccasions. er, w assume the en vironmen is stationary thattransitions is, that the pr ob abilitiesexp of making state transitionsthat or receiving ecic reinforcement signals do not c o v er time. 1 Reinforcemen t learning diers from the more widely studied problem of sup ervised learn- ing in sev eral w a ys. The most imp ortan dierence is that there is no presen tation of in- put/output pairs. Instead, after c ho osing an action the agent is told the immediate rew and the subsequentt state, but is not told whic h action w ould ha ve een in its b est long-term in terests.ts It is necessary for the agen t to gather useful exp erience ab out the p ossible states, actions, and rew ards activ ely to act optimally .b Another dierence from sup ervised learning that on-line p erformance imp ortan t: the ev aluation of the system is often concurren with learning. 1. This assumption mais y b e disapp oinsystems. ti ng after op eration in non-stationary en vironmen ts is the motiv ations for buildin g learning In fact, man y of the algorithms describ ed in later sections are eectiv e in slo wly-v arying non-stationary en vironments, but there is v ery little theoretical analysis in this area.
1.2 Mo Optimaldels Before wdels e canofour startmo thinking ab out algorithmst forsub learningitto to decide what del of optimalityeet,whates)discounalwpatmoamakthatamoprobabilitby willhwill b e. In particular, w ha v ho the agen should tak e the future in to accounbt in the decisions es ab to eha v e no w.tttime. There are three mo that ha v eenthink the of the ma jorit of w in this area. The nite-horizon mo del is the easiest to ab out giv en in time, the agen should optimize exp ected rew ard for the nextject h steps: E X t =0 r it needcnott worry ab out what will happ en after that. In this and t rte represenbts the scalar ard receiv ed steps in to the future. This del can t w o w ys. In the rst, the agen t will ha v non-stationary olicy one o v er On rst step it will takgiva(moa e is h-step optimal . This is dened to e the est action a v ailable en that has steps remaining in whic h act and gain reinforcemenesitsrewBehavior1): t. On next step it take a ( h 1)-step action, and so on, un til it nally tak a 1-step optimal action and terminates.optimal In the second, do es ra e e ding-horizon c ontr olthe , in whic h it alw ys tak es the h -step action. The agen t alwa ys acts according to the same p olicy but thetermedha alue of h limits ho ahead it oks in c ho osing its actions. The del is not a ys appropriate.actionaccordingtinloagenlivingtobevhawehangesyusedcweattenthet,orkexpressions,iswthatecifyttedInhomathematicallyspyfartomooptimallyoutis,ewmomenevoptimalThesubsequenisehaebrate, man cases w ma y not knoitsb w the precisenite-horizontb length of the agen t's life in adv ance. innite-horizon discoun ted mo del tak the long-run rew ard of the agen in to ac- coun t, but rew ards that are receiv ed in the futureitv are geometrically discoun to discoun factor , (where 0 E ( 1 X =0 t rt ) : W e can interpretas in several ways. It can e seen as an in terest of another step, or a tric k to b ound the innite sum. del conceptu- allyThe similar to receding-horizon con trol, but the ted del more tractable than the nite-horizon mo del. This is a dominan t reason for the wide tion this mo del has receivmathematical ed.
long-run average p erformance.vironment ossible to mo del so it takeswhic9,to:0in0,:=gained.actionisybdeledguar-plateauaseea accoun t b oth the long run erage and theinnite-horizontothis amount ofpro initial rew ard can b In the generalized, bias mo del, olicygeneralizepviding is preferred if it maximizesThere the long-run a vsingle erage and ties are brok en b y the initial extra rew Figure con trasts these mo dels of y b an enthan vironmen t in h c hanging the mo of optimalit y hanges the olicy . In this example, circles represen t the states of the en and arro ws are state transitions. is only a action c hoice from ev ery state except the start state, whic h isathat in the upp er left and mark ed with an incoming arro w. All rew ards are zero except where mark ed. Under a nite-horizon mo del withoptimalItcis:cpasthea h 5, the three actions yield rew ards of +6 : 0, +0 :0,t and +0 so the rst action should b e c hosen under an discoun ted mo del with the three c2 hoices yield +16=avis : 2, +59 0, andap +58 soy the second action should b hosen and under the a erage rew mo del, the third:5deadlineinnite-horizon action should b e c hosen since itmoc leads to an averageoptimalitBias-optimality rewrelativethe ard of +11. If w e hangeoptimalitoptimal h and 0.2, then the second is optimal for the nite-horizon mo del and the rst the discoun ted mo del ho w ev er, the ayv erage rew ard mo del will alw a ys prefer the b est long-term vkno erage. Since the c hoice of y mo del and matters so m uc it is imp ortan to c ho ose it carefully in an application. The mo del is appropriate when the agenh, t's lifetime is wn one im- p ortan t asp ect this mo del that the length of the remaining lifetime decreases, the agen t's p olicy ma y c system with hard w ould b e appropriately mo this w a y .nite-horizon The usefulnessparametersto of innite-horizon discoun ted and bias-optimal dels still under debate. has an tage of not requiring a discounlearned t parameter ho wevalgorithm. er, algorithms for nding bias-optimal p olicies are not y et as w ell-understo d those for nding optimal innite-horizon discounadv1000for ted p olicies. 1.3 Measuring LearningAarderformanceterms.toard. P The criteriaofgivofvdelhange. en in previous section can b e used to assess the& p olicies a giv en W e w ould also lik e to b e able evvior aluate the quality ofan, learning itself. There are sev eral incompatible measures in use. Even tual con v ergence to optimal. Man y algorithms come with a proo v able an tee asymptotic convergence to optimal b eha (W atkins Da y 1992). This is reassuring, but useless in practical An agen t that quic kly reac hes