Supplementary methods for the det...
Supplementary Methods Subjects and behavioral task 14 right-handed human subjects participated in the task. The subjects were pre-assessed to exclude those with a prior history of neurological or psychiatric illness. All gave informed consent, and the study was approved by the local ethics committee. The task consisted of two sessions of 150 trials each, separated by a short break. On each trial, subjects were presented with pictures of four different colored slot machines (visible on a screen reflected in a head coil mirror), and selected one using a button box with their right hand (see Fig. 1a). Subjects had a maximum of 1.5 seconds in which to make their choice if no choice was entered during that interval, a large red X was displayed for 4.2 seconds to signal an invalid missed trial (after which a new trial was triggered). Subjects usually responded well before the timeout, with a mean response time of ~430msecs Overall there were very few missed trials (typically 1 or 2 per subject). On valid trials, the chosen slot machine was animated and, three seconds later, the number of points earned was displayed. These points were displayed for 1 second and then the screen was cleared. The trial sequence ended 6 seconds after trial onset, followed by a jittered intertrial interval using a discrete approximation of a Poisson distribution with a mean of 2 seconds, before the next trial was triggered. The payoff for choosing the ith slot machine on trial t was between 1 and 100 points, drawn from a Gaussian distribution (standard deviation ��o = 4) around a mean ��i,t and rounded to the nearest integer. At each timestep, the means diffused in a decaying Gaussian random walk, with ��i,t+1 = ����i,t + (1 - ��)�� + �� for each i. The decay parameter �� was 0.9836, the decay center �� was 50, and the diffusion noise �� was zero-mean Gaussian (standard deviation ��d = 2.8). Each subject was exposed to one of three instantiations of this process one is illustrated in Figure 1B.
Subjects were instructed that they would be paid ���according to how many points you have won in total over the experiment,��� and to expect average earnings of about 20 UK pounds. However, they were not advised of the actual exchange rate for points, nor of their cumulative point totals. At the completion of the task (due to behavioral protocol restrictions on differential treatment of subjects) each was paid 19 UK pounds. Kalman filter model The Kalman filter1 is the Bayesian mean-tracking rule for the diffusion process described above. Assume the subject believes the process is governed by parameters �� o �� , �� d �� , , and (corresponding to �� �� �� �� ��o , ��d , , and above). Given, on trial t, a prior distribution over the true mean payoffs �� �� ��i,t as independent Gaussians, 2 �� �� ( pre pre i,t i,t N �� , �� ) , then if option ct is chosen and payoff rt received, the posterior mean for that option is: �� ��cpre t t c ,t ,t t �� = �� + �� ��t post with prediction error �� pre ct ��t = rt �� ��� ,t and learning rate (���gain���) 2 2 2 �� ��ct ��o /( ) pre pre ct ,t ,t ��t = �� �� + �� . The posterior variance for the chosen option is 2 2 post pre �� ��ct (1��� ) t c ,t t ,t �� = �� �� The posterior mean and variance for the unchosen options are unchanged by the observation. Taking into account the diffusion process, the prior distributions on the subsequent trial are given by �� ��) �� �� ��i,post pre i,t+1 t �� = �� �� +(1��� �� �� and 2 2 2 ��2 �� ��i,tpost ��d pre i,t+1 = �� �� + �� �� for all i. The recursive process is initialized with prior distribution �� ��i20pre ( ) pre i,0 , N �� , �� . Note that the heart of this procedure is an error-driven learning rule of the same form as TD or other delta-rule methods ��� the difference is the additional tracking of uncertainties ��i2,t �� , which determine
the trial-specific learning rates ��t. In general, uncertainties decrease for sampled options and increase for unsampled ones. Together with this tracking rule, we examined three choice rules, each of which determined the probability Pi,t of choosing option i on trial t as a function of the estimated payoffs. The ��-greedy rule is: 3�� arg max(��)et,pri�� otherwise i i P,t �� ���1 ��� = = ��� ��� with exploration parameter ��. (If there is a tie for the winning action, they are made equally probable.) The softmax rule is: exp(����i,pre ) �� ) t i,t pre j,t j �� P = �� �� ���exp( with exploration parameter ��. Finally, we tested a rule in which an exploration bonus2 of standard deviations was added to the expected mean payoff, and choices were softmax in this adjusted value: �� �� ��iptre exp( ]) �� �� ]) i,t , i,t pre pre j,t j,t j ��[ �� + ���� P = ��[ �� + ���� ���exp( pre 0 Note that this model nests uncertainty bonuses within a softmax scheme: it reduces to the simple softmax model for �� = (as was nearly the case in our behavioral fits) and to classic deterministic uncertainty-bonus exploration as �� approaches infinity with �� positive. Between these regimes, the model spans hybrids combining contributions of both approaches differentially according to the parameters. Behavioral analysis