New study showing evidence for temporal difference (TD) learning (an online reinforcement learning algorithm that borrows from Monte Carlo, dynamic programming and bootstrapping) in the brain.
Introductory article (READ THIS ONE FIRST)
TD algorithm, & electrophysiology in awake behaving monkeys
Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science 275: 1593-1599 (1997).
fMRI data for rewarding stimuli
O’Doherty, J, Dayan, P, Friston, K, Critchley, H & Dolan, R. Temporal difference models and reward-related learning in the human brain. Neuron 38: 329-337. (2003).
fMRI data for aversive stimuli
A BOLD fMRI signal like a TD error is seen in the ventral putamen. The study uses a neat “higher-order” learning task where the first of two cues is a good (but, without the second cue, not perfect) indicator of how painful the stimulus will be.
BEN SEYMOUR1, JOHN P. O’DOHERTY1, PETER DAYAN2, MARTIN KOLTZENBURG3, ANTHONY K. JONES4, RAYMOND J. DOLAN1, KARL J. FRISTON1 & RICHARD S. FRACKOWIAK. Temporal difference models describe higher-order learning in humans. Nature 429, 664 – 667 (10 June 2004)
ABSTRACT: The ability to use environmental stimuli to predict impending harm is critical for survival. Such predictions should be available as early as they are reliable. In pavlovian conditioning, chains of successively earlier predictors are studied in terms of higher-order relationships, and have inspired computational theories such as temporal difference learning1. However, there is at present no adequate neurobiological account of how this learning occurs. Here, in a functional magnetic resonance imaging (fMRI) study of higher-order aversive conditioning, we describe a key computational strategy that humans use to learn predictions about pain. We show that neural activity in the ventral striatum and the anterior insula displays a marked correspondence to the signals for sequential learning predicted by temporal difference models. This result reveals a flexible aversive learning process ideally suited to the changing and uncertain nature of real-world environments. Taken with existing data on reward learning2, our results suggest a critical role for the ventral striatum in integrating complex appetitive and aversive predictions to coordinate behaviour.