Skip Navigation


Cerebral Cortex Advance Access originally published online on June 4, 2007
Cerebral Cortex 2007 17(Supplement 1):i110-i117; doi:10.1093/cercor/bhm064
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
17/suppl_1/i110    most recent
bhm064v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Seo, H.
Right arrow Articles by Lee, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Seo, H.
Right arrow Articles by Lee, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Dynamic Signals Related to Choices and Outcomes in the Dorsolateral Prefrontal Cortex

Hyojung Seo1, Dominic J. Barraclough2 and Daeyeol Lee1

1 Department of Neurobiology, Yale University School of Medicine, New Haven, CT 06510, USA, 2 Department of Neurobiology and Anatomy, Center for Visual Science, University of Rochester, Rochester, NY 14627, USA

Address correspondence to Department of Neurobiology, Yale University School of Medicine, 333 Cedar Street, SHM C303, New Haven, CT 06510, USA. Email: daeyeol.lee{at}yale.edu.


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Although economic theories based on utility maximization account for a range of choice behaviors, utilities must be estimated through experience. Dynamics of this learning process may account for certain discrepancies between the predictions of economic theories and real choice behaviors of humans and other animals. To understand the neural mechanisms responsible for such adaptive decision making, we trained rhesus monkeys to play a simulated matching pennies game. Small but systematic deviations of the animal's behavior from the optimal strategy were consistent with the predictions of reinforcement learning theory. In addition, individual neurons in the dorsolateral prefrontal cortex (DLPFC) encoded 3 different types of signals that can potentially influence the animal's future choices. First, activity modulated by the animal's previous choices might provide the eligibility trace that can be used to attribute a particular outcome to its causative action. Second, activity related to the animal's rewards in the previous trials might be used to compute an average reward rate. Finally, activity of some neurons was modulated by the computer's choices in the previous trials and may reflect the process of updating the value functions. These results suggest that the DLPFC might be an important node in the cortical network of decision making.

Key Words: eligibility trace • neuroeconomics • reinforcement learning • reward • state-space model


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
In order to make decisions optimally, animals must be able to predict the outcomes of their actions efficiently and choose the action that produces most desirable outcomes. For optimal decision making, therefore, the animal needs to know the mapping between its actions and outcomes for various states of its environment. In reality, however, the properties of the animal's environment change almost constantly and therefore are seldom fully known. To improve their decision-making strategies adaptively, therefore, the animals need to update continually their estimates for the outcomes expected from their actions. Reinforcement learning theory provides a formal description of this process (Sutton and Barto 1998Go).

In reinforcement learning, the animal's estimates for the sum of all future rewards are referred to as value functions. Value functions can be used to predict the reward expected at each time step, and the discrepancy between the predicted reward and actual reward, referred to as the reward prediction error, can be used to update the value functions (Sutton and Barto 1998Go). Reward prediction errors are encoded by midbrain dopamine neurons (Schultz 2006Go). In general, however, how specific computations of reinforcement learning algorithms, such as temporal integration of reward prediction errors, are implemented in different brain areas is still not well known (Lee 2006Go; Daw and Doya 2006Go).

The prefrontal cortex has long been recognized for its contribution to working memory (Goldman-Rakic 1995Go), and much research has focused on how information held in working memory can be used flexibly to guide the animal's behavior (Miller and Cohen 2001Go). However, the prefrontal cortex may also play an important role in reinforcement learning. For example, signals related to the value functions and the choice outcomes have been identified in the prefrontal cortex (Daw and Doya 2006Go). Previously, we showed that during computer-simulated competitive games (von Neumann and Morgenstern 1944Go), monkeys might approximate optimal decision-making strategies using reinforcement learning algorithms (Lee et al. 2004Go, 2005Go). We have also found that during the same task, individual neurons in the dorsolateral prefrontal cortex (DLPFC) encode signals related to the animal's choice and its outcome in the previous trial (Barraclough et al. 2004Go). In the present study, we investigated whether and how the signals related to the animal's choice and its outcome are maintained across multiple trials in the DLPFC.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Animal Preparations

Five rhesus monkeys (4 males and 1 female; body weight = 5–12 kg) were used. The animal's eye movements were monitored at a sampling rate of 250 or 500 Hz with either a scleral eye coil (Riverbend Instrument, Birmingham, AL) or a high-speed video-based eye tracker (ET 49, Thomas Recording, Giessen, Germany). Once the animal was completely trained for the behavioral tasks, a recording chamber was attached over the DLPFC. In 3 animals, a second recording chamber was implanted over the parietal cortex, and activity was often recorded simultaneously from the 2 chambers. All the procedures used in this study conformed to the National Institutes of Health guideline and were approved by the University of Rochester Committee on Animal Research.

Behavioral Tasks

Monkeys were trained to perform an oculomotor free-choice task that simulated a 2-player zero-sum game, known as the matching pennies task (Fig. 1). A trial began when the animal fixated a small yellow square in the center of a computer screen. Following a 0.5-s fore period, 2 green disks were presented along the horizontal meridian, and the central fixation target was extinguished after a 0.5-s delay period. The animal was then required to shift its gaze toward one of the peripheral targets and maintain its fixation during a 0.5-s hold period. At the end of this hold period, a red feedback ring appeared around the target chosen by the computer. The animal was rewarded only when it chose the same target as the computer and successfully maintained its fixation on the chosen target during the 0.5-s (0.2 s for some neurons, n = 133) feedback period following the feedback onset.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. Visual stimuli and payoff matrix (inset) for the matching pennies game. The duration of fore period and delay period was 0.5 s, and the animal was required to shift its gaze toward one of the peripheral targets within 1 s after the central target was extinguished and hold its fixation for 0.5 s (Sacc/Fix). The duration of the feedback ring was 0.2 or 0.5 s.

 
As described previously (Barraclough et al. 2004Go; Lee et al. 2004Go), the computer was programmed to exploit statistical biases in the animal's choice behavior. At the beginning of each trial, the computer made a prediction about the animal's choice by applying a set of statistical tests based on the animal's entire choice and reward history during a given recording session. First, the probability that the animal would choose a particular target as well as a set of conditional probabilities that the animal would choose a particular target given the animal's choices in the preceding n trials (n = 1–4) were estimated. In addition, the conditional probabilities that the animal would choose a particular target given its choices and rewards in the preceding n trials (n = 1–4) were also estimated. Second, for each of these 9 probabilities, the computer tested the null hypothesis that the animal had chosen the 2 targets randomly with equal probabilities and independently of its choices and their outcomes in the previous trials. When this null hypothesis was not rejected for any probability, the computer selected each target with a 0.5 probability. Otherwise, the computer biased its selection according to the conditional probability with the largest deviation from 0.5 that was statistically significant (binomial test, P < 0.05). For example, if the animal chose the rightward target significantly more frequently, with a 0.75 probability, following a rewarded trial in which the animal selected the leftward target, and if this conditional probability deviated more from 0.5 than any other conditional probabilities, then the computer chose the leftward target with a 0.75 probability.

At the beginning of each recording session, the animal performed 130 trials of a visual search task, which was identical to the matching pennies task described above, except that one of the 2 peripheral targets was red. The position of the red target was chosen pseudorandomly for each trial. The animal was never rewarded when it selected the red target. In addition, to match the overall reward probability for the 2 tasks, the animal was rewarded with a 0.5 probability when it selected the green target in the search task.

Neurophysiological Recording

Single-neuron activity was recorded extracellularly in the DLPFC, using a 5-channel multielectrode recording system (Thomas Recording, Giessen, Germany). The placement of the recording chamber was guided by magnetic resonance images, and this was confirmed in 2 animals by metal pins inserted in known anatomical locations at the end of the experiments. In addition, the frontal eye field (FEF) was localized in all animals as sites in which eye movements were evoked by electrical stimulations with currents <50 µA during active fixation of a visual target (Goldberg et al. 1986Go). All the neurons described in this study were anterior to the FEF.

Reinforcement Learning Model

In reinforcement learning models, the value function for choosing target x is updated according to the reward prediction error (Sutton and Barto 1998Go) as follows:

Formula
where Vt(x) denotes the value function for target x in trial t, rt the reward received by the animal in trial t, and ß the step-size parameter. This can be rearranged as

Formula

In other words, the process of updating the value function can be described by a first-order autoregressive model with an exogenous input. To indicate this explicitly and be consistent with the notation in our previous study (Barraclough et al. 2004Go; Lee et al. 2004Go), the above equation was reparameterized as

Formula
where the decay factor {alpha} = (1 – ß) and the exogenous input {Delta}t(x) = ßrt. A large decay factor indicates that the outcome of the animal's choice in a given trial would influence the animal's choices across a relatively large number of trials. We assumed that {Delta}t(x) = {Delta}rew if the animal was rewarded at trial t, and {Delta}t(x) = {Delta}unrew otherwise. Thus, {Delta}rew and {Delta}unrew reflect how the value function for the target chosen by the animal is influenced by the outcome of the animal's choice. The signs of these parameters indicate whether the animal would be more likely to choose the same target in the future trials. For example, positive {Delta}rew and negative {Delta}unrew correspond to the so-called win-stay and lose-switch strategies, respectively. The probability that the animal would choose the rightward target in trial t, Pt(R), was then determined by the softmax transformation as follows:

Formula

As a result, the probability of choosing the rightward target increased gradually as its value function increased and as the value function for the leftward target decreased. It should be noted that the above equation does not include the inverse temperature parameter, so the magnitudes of {Delta}rew and {Delta}unrew determine how deterministically the animal's choice is influenced by the outcomes of its previous choices (Lee et al. 2004Go). All model parameters ({alpha}, {Delta}rew, {Delta}unrew) were estimated separately for each recording session using a maximum likelihood procedure (Pawitan 2001Go; Lee et al. 2004Go) by taking the best parameters obtained from 5 independent searches performed using the initial parameters randomly chosen in the interval of [0 1] for {alpha} and {Delta}rew and [–1 0] for {Delta}unrew. The maximum likelihood procedure was implemented using the fminsearch function in Matlab 7.0 (Mathworks Inc., Natick, MA). Thus, the parameters were not restricted to any particular interval.

Time-Series Analyses of Neural Data

A series of time-series models were applied to determine whether the activity of a given neuron was influenced by the choices of the animal and computer and by the rewards in the current and previous trials. For these analyses, spikes were counted for successive 0.5-s bins defined relative to the time of target onset or feedback onset. In the present study, we focused on 3 different time bins corresponding to the fore period, delay period, and feedback period. Variability in the activity of cortical neurons is often temporally correlated (Lee et al. 1998Go; Bair et al. 2001Go). Therefore, in order to distinguish the effects of different behavioral variables in the previous trials on neural activity from the temporal correlation in neural activity resulting from slow changes in the neuron's intrinsic excitability, the spike counts were detrended separately for each bin by taking the residuals from a linear regression model that includes the trial number as the independent variable. We also modeled the temporal correlation in neural activity using a first-order autoregressive moving-average models with exogenous inputs (ARMAX; Ljung 1999Go). Because this model included a first-order moving-average term and a first-order autoregressive term, it is commonly referred to as ARMAX(1,1). In this model, the detrended spike counts in a particular bin of trial t, y(t), is given by the following:

Formula
where u(t) is a row vector consisting of 3 binary variables corresponding to the animal's choice (0 and 1 for leftward and rightward choices, respectively), the computer's choice (coded as for the animal's choice), and the reward (0 and 1 for unrewarded and rewarded trials, respectively) in trial t; A (1 x 1), B (1 x 12), and C (1 x 1) are the vectors of coefficients, and e(t) is the error term. As special cases of this ARMAX model, we also considered 1) a model without any autoregressive or moving-average terms (A = 0 and C = 0), ARMAX(0,0); 2) a model only with the first-order autoregressive term (C = 0), ARMAX(1,0); 3) a model only with the first-order moving-average term (A = 0), ARMAX(0,1); in addition to (4) the full ARMAX(1,1) model described above. For ARMAX(0,0), which is equivalent to the standard multiple linear regression model, a statistical significance for each coefficient in B was determined with a t-test (P < 0.05).

We also applied the state-space model to the same data. The state-space model consists of the following transition and observation equations:

Formula
where F (1 x 1), G (1 x 3), K (1 x 1), H (1 x 1), and J (1 x 3) are the vectors of coefficients. In the present study, we only considered a one-dimensional state space. Therefore, this model assumes that the effects of the behavioral events in the previous trials on the neural activity are mediated by the state-space variable that follows a first-order autoregressive process. If this assumption is true, then the above state-space model would account for the data more parsimoniously, namely, with a fewer parameters than other time-series models. The performance of each model was evaluated with the Akaike's information criterion (AIC), given by

Formula
where L denotes the likelihood of the model computed under the assumption of Gaussian distribution for the error and N the number of model parameters.

To test whether signals related to the animal's choices and rewards in the previous trial are influenced by the type of decisions made by the animal, the neural activity during the search trials (n = 130) and the first 260 trials of the matching pennies were analyzed using the following regression analysis. The neurons examined for <260 trials in the matching pennies task were excluded from this analysis.

Formula
where y(t) is the detrended spike count at trial t, u(t) is a vector consisting of the animal's choice, the computer's choice, and the reward in trial t, v(t) = u(t) for the trials in the matching pennies task (i.e., t = 131–390) and 0 otherwise, and w(t) = u(t) for the second block of 130 trials in the matching pennies task (i.e., t = 261–390) and 0 otherwise. BAll, BTask, and BBlock are the vectors of regression coefficients. Thus, BAll reflects the overall strength of signals related to various behavioral events, whereas BTask reflects the extent to which the neural activity related to the same behavioral events differs for the 2 tasks. By comparing the activity during the second block of 130 trials during the matching pennies task to the activity in the preceding trials, BBlock provides an estimate of nonstationarity in neural activity related to choices and rewards. For the search task, the computer's choice was defined as the correct (green) and incorrect (red) targets in rewarded and unrewarded trials, respectively.


    Results
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Choice Behavior during the Matching Pennies Task

The behavioral data were collected from a total of 81 742 trials in 140 recording sessions (Table 1). These data were analyzed by fitting a reinforcement learning model (see Materials and Methods). The parameters of the model were estimated separately for each recording session. Overall, the decay factors were relatively large and skewed toward one, indicating that the effects of the previous choice outcomes were integrated over multiple trials (Fig. 2A). In addition, in approximately two-thirds of the sessions (94/140 sessions), the value functions increased and decreased for the target chosen by the animal, when it was rewarded and unrewarded, respectively (Fig. 2B).


Figure 2
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Parameters of the reinforcement learning model applied to choice behaviors in the matching pennies task. (A) Distribution of decay factors. (B) Scatter plot for the incremental changes in the value function applied after rewarded (abscissa) and unrewarded (ordinate) trials.

 


View this table:
[in this window]
[in a new window]

 
Table 1 The number of recording sessions, trials, and neurons in each animal

 
Signals Related to Previous Choices and Outcomes

Single-unit activity was recorded from 322 neurons in the DLPFC during the matching pennies task (Table 1). Each neuron was tested for 130 trials during the search task, and at least 128 trials during the matching pennies task. The average number of trials tested during the matching pennies task was 584 (Table 1). Many of these neurons modulated their activity according to the animal's previous choices and their outcomes. In some neurons, the previous choices of the computer opponent also influenced their activity. However, the time course of the activity related to these different behavioral events varied substantially across different neurons (Figs 3 and 4). For example, for the neuron illustrated in Figure 3, the time courses and strengths of signals related to the animal's choice, the computer's choice, and reward were relatively similar. This neuron increased its activity around the time of eye movements, when the animal chose the rightward target (Fig. 3, top, trial lag = 0). In addition, the activity of the same neuron increased during the fore period and delay period when the animal had selected the rightward target in the previous trial (Fig. 3, top, trial lag = 1) and when the animal had been rewarded in the previous trial (Fig. 3, bottom, trial lag = 1). During the delay period, this neuron also increased its activity when the computer opponent had selected the rightward target in the previous trial (Fig. 3, middle, trial lag = 1). On the other hand, the neuron shown in Figure 4 modulated its activity mostly according to the recent reward history of the animal. When the animal was rewarded in a particular trial, its immediate effect was to increase the neuron's activity (Fig. 4, bottom, trial lag = 0). However, the activity of this neuron was reduced when the animal was rewarded in the previous 2 trials (Fig. 4, bottom, trial lag = 1 and 2). Thus, the activity of this neuron during the feedback period was enhanced when the animal was rewarded after one or more unrewarded trials.


Figure 3
View larger version (44K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3. An example neuron in the DLPFC that modulated its activity according to the animal's choice, computer's choice, and reward in the previous trial (trial lag = 1). Each panel shows a pair of spike density functions estimated separately for trials sorted by the animal's choice, the computer's choice, or reward in the current trial (trial lag = 0) or previous trials (trial lag = 1–3). Neural activity is aligned according to the target onset (left plots) or feedback onset (right plots). In the top 2 rows, the black and blue lines correspond to the leftward and rightward choices, whereas in the bottom row, they correspond to the unrewarded and rewarded trials, respectively. Dotted vertical lines correspond to the time when the animal fixated the central target or the onset time of feedback ring. Circles are the standardized regression coefficients from a linear regression model, and filled circles indicate that the effect was statistically significant (t-test, P < 0.05).

 


Figure 4
View larger version (44K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4. Another example neuron in the DLPFC that modulated its activity according to the animal's choice and reward in the previous trial. In this neuron, the effect of reward was maintained in multiple trials. Same format as in Figure 3.

 
How the activity of neurons in the DLPFC was influenced by different behavioral events was quantified using a regression model, which is referred to as ARMAX(0,0), in the present study. The results from this analysis showed that the animal's choice, the computer's choice, and the reward in the previous trial significantly influenced the activity in 36.0%, 19.3%, and 37.9% of the DLPFC neurons during the fore period, respectively, and in 40.1%, 18.3%, and 33.2% of the neurons during the delay period (Fig. 5). The fractions of DLPFC neurons that significantly modulated their activity during the fore period in a given trial according to the choice made by the animal and reward 2 trials before were both 11.2%. The fraction of neurons that displayed significant modulations in their activity according to the choice made by the computer opponent 2 trials before was 7.8% during the fore period. During the feedback period, the activity of DLPFC neurons was frequently affected by the animal's choice, the computer's choice and reward in the same trial but was also affected by the animal's choice and reward in the 3 previous trials (Fig. 5).


Figure 5
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5. Time course of neural signals related to choice and reward. Histograms show the fractions of neurons that displayed significant modulations in their activity according to the animal's choice (top), the computer's choice (middle), and the choice outcome (reward, bottom) during various time bins in the current (trial lag = 0) and previous (trial lag=1, 2, and 3) trials. The asterisks indicate that the proportion of neurons is significantly higher than the P value (0.05) used in the regression analysis according to a binomial test (P < 0.05).

 
Task-Specific Choice Signals

To test whether activity changes related to the animal's previous choices and their outcomes were specific to the matching pennies task, we analyzed the activity of 284 neurons in which the data were collected from at least 260 trials during the matching pennies task in addition to 130 trials during the search task. In order to exclude the possibility that seemingly task-specific activity might be due to random nonstationary changes in neural activity, we included a set of control variables to determine whether significant changes also occurred during the 2 successive blocks of trials in the matching pennies task (see Materials and Methods). We found that the overall percentages of neurons that modulated their activity according to the previous choices of the animal and the computer opponent or the outcomes of the animal's choices were similar for the search task and the matching pennies task (data not shown). Nevertheless, many neurons in the DLPFC encoded the signals related to the animal's choices in the current and previous trials differently for the 2 tasks (Fig. 6). For example, during the delay period, 23.9% of the neurons modulated their activity according to the animal's choice differently for the search task and the matching pennies task (Fig. 6, top, trial lag = 0). This is not surprising because the animal received explicitly instruction about its eye movement only in the search task. By contrast, only 3.5% of the neurons displayed similar changes during the delay period between the 2 successive blocks of trials in the matching pennies task. This difference was statistically significant ({chi}2 test, P < 10–10). During the delay period, 15.5% of the neurons also modulated their activity according to the animal's choice in the previous trial differently for the 2 tasks, whereas only 5.3% of the neurons displayed similar changes between the 2 blocks of trials in the matching pennies task ({chi}2 test, P < 10–4). By contrast, large task-specific changes in neural activity related to the choice of the computer opponent or reward were seen only during the feedback period of the same trial (Fig. 6, middle and bottom), suggesting that the outcomes of the animal's previous choices similarly influenced the neural activity in the DLPFC for the 2 tasks.


Figure 6
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6. Task-specific modulation of neural signals related to choice and reward. Histograms labeled "All" show the fractions of neurons that displayed overall modulations in their activity according to the animal's choice, the choice of the computer, and reward in the current trial (trial lag = 0) or in the previous trials (trial lag = 1 or 2) regardless of the task. Histograms labeled "Task" show the fractions of neurons in which the effects of behavioral events on neural activity differed significantly for the trials in the search task and matching pennies task. Finally, histograms labeled "Block" show the fractions of neurons that displayed nonstationary changes in the signals related to choices and rewards between the 2 successive blocks of 130 trials in the matching pennies task.

 
Comparison of ARMAX and State-Space Models

The regression model described above included each of 3 different behavioral events in 3 previous trials. This postulates that information about each of these distinct events is stored in the brain separately, and their effects are combined to determine the activity of individual neurons in a given trial. Alternatively, neural activity in a given trial might be determined by the state of the brain that undergoes certain dynamic changes on a trial-by-trial basis under the influence of certain behavioral events. To test this possibility, we applied a state-space model, commonly known as the Kalman filter model, to estimate the state in each trial and used this state information to predict the activity of each neuron (see Materials and Methods). For comparison, 3 other time-series models were fit to the data, namely, a first-order autoregressive model, a first-order moving-average model, and a first-order autoregressive moving-average model. All these models included the same exogenous input variables used in the state-space model. According to the AIC, the state-space model was selected as the best model most frequently regardless of the epochs examined (Fig. 7). For some neurons, the autoregressive model or the autoregressive moving-average model performed better than the state-space model. The moving-average model was never chosen as the best model.


Figure 7
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7. Fraction of neurons in the DLPFC for which a particular time-series model was chosen as the best model. SS(1), first-order state-space model; (0, 0), regression model without autoregressive or moving-average terms; (1, 0), first-order autoregressive model; (0, 1), first-order moving-average model; and (1, 1), first-order autoregressive moving-average model.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Using a decision-making task that simulated a simple competitive interaction with another decision maker, we found that monkeys tend to seek an optimal decision-making strategy according to a reinforcement learning algorithm (Lee et al. 2004Go; Corrado et al. 2005Go; Lau and Glimcher 2005Go; Lee et al. 2005Go; Samejima et al. 2005Go). The decay factors in the reinforcement learning model were relatively large, suggesting that the outcomes of multiple trials in the past were temporally integrated and influenced the animal's choice in a given trial. Consistent with this behavioral finding, a significant number of neurons in the DLPFC also modulated their activity according to the animal's choices and their outcomes in multiple trials. The fact that the model based on a state space accounted for the neural data more parsimoniously compared with other time-series models suggests that the signals related to the animal's choices and their outcomes might be temporally integrated in the form of a state variable in the DLPFC. Therefore, the DLPFC might be an important node in the cortical network that is responsible for monitoring the outcomes of previous choices and using that information to update the animal's decision-making strategies dynamically. However, the exact mechanism by which these signals are used to update the value functions or decision-making strategies is not known. Different types of signals identified in the present study, such as the animal's choices and rewards, might contribute to the following aspects of adaptive decision making.

First, in reinforcement learning theory, signals related to the decision maker's previous actions are referred to as the eligibility trace. Such signals can link a reward delivered at a particular time step to an action that caused it, when these 2 events are temporally separated (Sutton and Barto 1998Go). Eligibility trace was not incorporated into the reinforcement learning algorithm we applied to model the animal's choice behavior because during the matching pennies task the outcome of a particular action was revealed immediately. Nevertheless, the neural signals related to eligibility trace might be utilized in more complex tasks involving multistage decision making (Saito et al. 2005Go; Averbeck et al. 2006Go; Sohn and Lee 2006Go). Second, signals related to the rewards in the previous trials might be used to compute an average rate of reward. It has been reported that neurons in the orbitofrontal cortex also encode signals related to rewards in the previous trials (Sugrue et al. 2004Go). During the process of decision making, information about the average reward rate might be utilized in several ways. For example, in a class of reinforcement learning algorithms, referred to as average reward reinforcement learning, the average reward rate is used as a criterion for optimal decision making (Mahadevan 1996Go). In addition, choices of humans and other animals may be influenced by the same outcome differently, depending on whether it is considered as a gain or loss (Tinklepaugh 1928Go; Crespi 1942Go; Zeaman 1949Go; Kahneman and Tversky 1979Go; Flaherty 1982Go). Therefore, signals related to reward rate may influence the process of decision making by providing a frame of reference (Helson 1948Go). Information about the average reward rate may also play a role in setting the optimal level of threshold used to terminate the process of evidence accumulation during the process of perceptual decision making (Simen et al. 2006Go) or switching between exploitation and exploration (Aston-Jones and Cohen 2005Go). Finally, neurons in the DLPFC encoded signals related to the previous choices of the computer opponent, although less often than those related to the animal's previous choices and rewards. During the matching pennies game, the animal was rewarded only when it chose the same target as the computer opponent, so signals related to the computer's previous choices might directly contribute to the process of computing the value functions for alternative choices.

Signals related to the animal's choices, their outcomes, and the previous choices of the opponent were sometimes multiplexed in a single neuron in the DLPFC. In addition, when these same variables were used as exogenous inputs, the one-dimensional state-space model often provided a parsimonious description of activity in the DLPFC. This raises the possibility that in some neurons, the process of integration might be applied after signals related to multiple variables are combined. Whether these different types of signals are then demultiplexed and utilized for different purposes by separate groups of downstream neurons is not known. In addition, single-cell and network mechanisms for integrating these signals in the prefrontal cortex are not well understood. It has been shown that a recurrent network combined with a reward-dependent stochastic Hebbian learning rule can reproduce the choice behavior observed in monkeys during the matching pennies game (Soltani and Wang 2006Go; Soltani et al. 2006Go). However, mechanisms for temporally integrating signals related to these multiple events need to be further investigated in future studies.


    Acknowledgments
 
We are grateful to Lindsay Carr and John Swan-Stone for their technical assistance. This study was supported by a grant from the National Institute of Mental Health (MH073246).

Conflict of Interest: None declared.


    References
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Aston-Jones G, Cohen JD. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu Rev Neurosci (2005) 28:403–450.[CrossRef][Web of Science][Medline]

Averbeck BB, Sohn J-W, Lee D. Activity in prefrontal cortex during dynamic selection of action sequences. Nat Neurosci (2006) 9:276–282.[CrossRef][Web of Science][Medline]

Bair W, Zohary E, Newsome WT. Correlated firing in macaque visual area MT: time scales and relationship to behavior. J Neurosci (2001) 21:1676–1697.[Abstract/Free Full Text]

Barraclough DJ, Conroy ML, Lee D. Prefrontal cortex and decision making in a mixed-strategy game. Nat Neurosci (2004) 7:404–410.[CrossRef][Web of Science][Medline]

Corrado GS, Sugrue LP, Seung HS, Newsome WT. Linear-nonlinear-poisson models of primate choice dynamics. J Exp Anal Behav (2005) 84:581–617.[CrossRef][Web of Science][Medline]

Crespi LP. Quantitative variation of incentive and performance in the white rat. Am J Psychol (1942) 55:467–517.[CrossRef][Web of Science]

Daw ND, Doya K. The computational neurobiology of learning and reward. Curr Opin Neurobiol (2006) 16:199–204.[CrossRef][Web of Science][Medline]

Flaherty CF. Incentive contrast: a review of behavioral changes following shifts in reward. Anim Learn Behav (1982) 10:409–440.[Web of Science]

Goldberg ME, Bushnell MC, Bruce CJ. The effect of attentive fixation on eye movements evoked by electrical stimulation of the frontal eye fields. Exp Brain Res (1986) 61:579–584.[CrossRef][Web of Science][Medline]

Goldman-Rakic PS. Cellular basis of working memory. Neuron (1995) 14:477–485.[CrossRef][Web of Science][Medline]

Helson H. Adaptation-level as a basis for a quantitative theory of frames of reference. Psychol Rev (1948) 55:297–313.[Medline]

Kahneman D, Tversky A. Prospect theory: an analysis of decision under risk. Econometrica (1979) 47:263–291.[CrossRef][Web of Science]

Lau B, Glimcher PW. Dynamic response-by-response models of matching behavior in rhesus monkeys. J Exp Anal Behav (2005) 84:555–579.[CrossRef][Web of Science][Medline]

Lee D. Neural basis of quasi-rational decision making. Curr Opin Neurobiol (2006) 16:191–198.[CrossRef][Web of Science][Medline]

Lee D, Conroy ML, McGreevy BP, Barraclough DJ. Reinforcement learning and decision making in monkeys during a competitive game. Cogn Brain Res (2004) 22:45–58.[CrossRef][Medline]

Lee D, McGreevy BP, Barraclough DJ. Learning and decision making in monkeys during a rock-paper-scissors game. Cogn Brain Res (2005) 25:416–430.[CrossRef][Medline]

Lee D, Port NL, Kruse W, Georgopoulos AP. Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J Neurosci (1998) 18:1161–1170.[Abstract/Free Full Text]

Ljung L. System identification: theory for the user. (1999) Upper Saddle River (NJ): Prentice-Hall Inc.

Mahadevan S. Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach Learn (1996) 22:159–195.

Miller EK, Cohen JD. An integrative theory of prefrontal cortex function. Annu Rev Neurosci (2001) 24:167–202.[CrossRef][Web of Science][Medline]

Pawitan Y. In all likelihood: statistical modelling and inference using likelihood. (2001) Oxford: Oxford University Press.

Saito N, Mushiake H, Sakamoto K, Itoyama Y, Tanji J. Representation of immediate and final behavioral goals in the monkey prefrontal cortex during an instructed delay period. Cereb Cortex (2005) 15:1535–1546.[Abstract/Free Full Text]

Samejima K, Ueda Y, Doya K, Kimura M. Representation of action-specific reward values in the striatum. Science (2005) 310:1337–1340.[Abstract/Free Full Text]

Schultz W. Behavioral theories and the neurophysiology of reward. Annu Rev Psychol (2006) 57:87–115.[CrossRef][Web of Science][Medline]

Simen P, Cohen JD, Holmes P. Rapid decision threshold modulation by reward rate in a neural network. Neural Netw (2006) 19:1013–1026.[CrossRef][Web of Science][Medline]

Sohn J-W, Lee D. Effects of reward expectancy on sequential eye movements in monkeys. Neural Netw (2006) 19:1181–1191.[CrossRef][Web of Science][Medline]

Soltani A, Lee D, Wang X-J. Neural mechanism for stochastic behaviour during a competitive game. Neural Netw (2006) 19:1075–1090.[CrossRef][Web of Science][Medline]

Soltani A, Wang X-J. A biophysically based neural model of matching law behavior: melioration by stochastic synapses. J Neurosci (2006) 26:3731–3744.[Abstract/Free Full Text]

Sugrue LP, Corrado GS, Newsome WT. Neural correlates of value in orbitofrontal cortex of the rhesus monkey. Program No. 671.8. 2004 Abstract Viewer/Itinerary Planner [Internet]. (2004) Washington (DC): Society for Neuroscience. Available from: URLhttp://sfn.scholarone.com/itin2004/.

Sutton RS, Barto AG. Reinforcement learning: an introduction. (1998) Cambridge (MA): MIT Press.

Tinklepaugh OL. An experimental study of representative factors in monkeys. J Comp Psychol (1928) 8:197–236.[CrossRef][Web of Science]

von Neumann J, Morgenstern O. Theory of games and economic behavior. (1944) Princeton (NJ): Princeton University Press.

Zeaman D. Response latency as a function of the amount of reinforcement. J Exp Psychol (1949) 39:466–483.[Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
J. L. Pardo-Vazquez, V. Leboran, and C. Acuna
A role for the ventral premotor cortex beyond performance monitoring
PNAS, November 3, 2009; 106(44): 18815 - 18819.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
K. Wunderlich, A. Rangel, and J. P. O'Doherty
Neural computations underlying action-based decision making in the human brain
PNAS, October 6, 2009; 106(40): 17199 - 17204.
[Abstract] [Full Text] [PDF]


Home page
J. Neurosci.Home page
C.-H. Luk and J. D. Wallis
Dynamic Encoding of Responses and Outcomes by Neurons in Medial Prefrontal Cortex
J. Neurosci., June 10, 2009; 29(23): 7526 - 7539.
[Abstract] [Full Text] [PDF]


Home page
J. Neurosci.Home page
H. Seo, D. J. Barraclough, and D. Lee
Lateral Intraparietal Cortex and Reinforcement Learning during a Mixed-Strategy Game
J. Neurosci., June 3, 2009; 29(22): 7278 - 7289.
[Abstract] [Full Text] [PDF]


Home page
J. Neurosci.Home page
H. Seo and D. Lee
Behavioral and Neural Changes after Gains and Losses of Conditioned Reinforcers
J. Neurosci., March 18, 2009; 29(11): 3627 - 3641.
[Abstract] [Full Text] [PDF]


Home page
J. Neurosci.Home page
S. Tsujimoto, A. Genovesio, and S. P. Wise
Monkey Orbitofrontal Cortex Encodes Response Choices Near Feedback Time
J. Neurosci., February 25, 2009; 29(8): 2569 - 2574.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
H. Seo and D. Lee
Cortical mechanisms for reinforcement learning in competitive games
Phil Trans R Soc B, December 12, 2008; 363(1511): 3845 - 3857.
[Abstract] [Full Text] [PDF]


Home page
J. Neurosci.Home page
J.-W. Sohn and D. Lee
Order-Dependent Modulation of Directional Signals in the Supplementary and Presupplementary Motor Areas
J. Neurosci., December 12, 2007; 27(50): 13655 - 13666.
[Abstract] [Full Text] [PDF]


Home page
J. Neurophysiol.Home page
Y. B. Kim, N. Huh, H. Lee, E. H. Baeg, D. Lee, and M. W. Jung
Encoding of Action History in the Rat Ventral Striatum
J Neurophysiol, December 1, 2007; 98(6): 3548 - 3556.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
17/suppl_1/i110    most recent
bhm064v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Seo, H.
Right arrow Articles by Lee, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Seo, H.
Right arrow Articles by Lee, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?