Reinforcement Learning
Prof. Rafal Bogacz
Psychologically, habits are defined as the reward-independent, stimulus-response relationships which form when identical actions are repeated often. These behaviours were originally studied in animals, and Hardwick et al. (2019) recently developed a paradigm to identify habits in humans. They hypothesised that human habits may be detected when participants are forced to act too quickly for conscious (goal-directed) control to be applied. They trained participants extensively on a stimulus-response mapping, and then the mapping was reversed. When participants were tested post-reversal, their behaviour changed depending on how rapidly they needed to react. Specifically, participants made more ‘habitual’ errors, i.e., choosing the original response, when forced to respond within 300-600ms. Hardwick et al. proposed that parallel accumulators were responsible, wherein the goal-directed system is initiated after a delay. However, no formal mathematical model exists that instantiates this proposal and allows for multiple drift rates which change both across (via reinforcement learning) and within (parallel accumulators) trials. In this paper, we present a novel 2-drift race model, and calculate the probability of reaction times and choices so it can be efficiently fitted to data from the paradigm by Hardwick et al. To test their proposal, we compare the quality of fit of a single-drift Q-learn race model and that of our model, in which habit and goal-directed actions accumulate independently. Furthermore, the best fit parameters of the 2-drift model can provide several key insights into, and quantifiable measures of, the mechanistic structure underlying the differences between individuals’ reliance on habits, undetectable in behaviour alone.
This is an in-person presentation on July 21, 2024 (11:40 ~ 12:00 CEST).
Niek Stevenson
Prof. Birte Forstmann
Prof. Andrew Heathcote
Humans evolved in non-stationary environments, which require continuous adaptations of behavior to change. Decision making is, on the other hand, often studied in highly controlled experimental paradigms where the environment is mostly stationary. We propose that adaptive mechanisms continue to act in stationary environments and cause systematic fluctuations in performance from trial to trial. We develop and test a set of formal decision-making models that embrace adaptive mechanisms. Specifically, participants use reinforcement learning to estimate their own performance and the statistical structure of the stimuli on which decisions are based. These estimates subsequently influence evidence-accumulation-model parameters. In four datasets, we show that these mechanisms can explain the occurrence of post-error slowing and choice biases related to stimulus sequences, respectively. We argue that including adaptive mechanisms in evidence-accumulation models is a promising way forward to understanding not only how choice behavior changes across time, but also why it changes.
This is an in-person presentation on July 21, 2024 (12:00 ~ 12:20 CEST).
Prof. Bettina von Helversen
Mrs. Ann Katrin Hosch
Prof. Lars Hornuf
In situations demanding loss avoidance or gain maximization, individuals must possess a profound understanding of the rules and regularities in their environment. However, exploration behavior varies across such scenarios. Past research has been inconclusive regarding the impact of a loss compared to a gain domain, particularly when exploration involves potential costs. In contrast, the current project centers on scenarios where subjects receive positive or negative rewards while exploring the environment or exploiting their knowledge. Participants engaged in a Multi-Armed Bandit task in three conditions: only gains in the environment, only losses in the environment, and a mixed condition involving both gains and losses. Notably, participants exhibited reduced exploration in the gain domain compared to the loss domain, with the mixed domain falling in between. Interestingly, participants performed best in the mixed domain. Computational modeling of participants' choice behavior revealed that individuals tend to underestimate outcomes of unchosen options in the gain domain and overestimate them in the loss domain. This pattern of findings could be attributed to the effects of absolute gains and losses or the effects of outcomes being relatively better or worse than the initial expectations.
This is an in-person presentation on July 21, 2024 (12:20 ~ 12:40 CEST).
Kristin Witte
Dr. Eric Schulz
The explore-exploit dilemma is ubiquitous in everyday life: Should you go to the cafeteria again or try out the new restaurant around the corner? Researchers have proposed three strategies for how humans solve this dilemma: value-guided exploration, directed exploration, and Thompson sampling, which can be inferred using computational modeling. Behavioral research conducted over the last two decades suggests that people use a mixture of all three strategies to solve the explore-exploit dilemma. We collected responses from 200 participants (after exclusions) to a set of three commonly used few-armed bandit tasks before and after a six-week period to examine the reliability and validity of the three strategies. The currently accumulating results can be summarized as follows: First, identifying all three exploration strategies is not possible in these relatively simple bandit tasks, because the strategies are too highly correlated with each other. Second, not every task motivates exploration to the same extent introducing potential task-based variability into the measurement of the remaining two strategies. To remove this task-based variability from the measurement, we present an attempt to extract higher-order factors of value-guided and directed exploration using the responses from all three few-armed bandit tasks. Third, we contrast the retest reliability, convergent validity, and external validity of these latent factors with those of task-based performance measures. While behavior can be measured reliably in all three tasks and is correlated across tasks, the reliability of the model parameters is lower. The implications for the importance of exploration as an explanatory cognitive construct are discussed.
This is an in-person presentation on July 21, 2024 (12:40 ~ 13:00 CEST).
Submitting author
Author