Society for Mathematical Psychology

MathPsych/ICCM 2021 Recinormal Learning

...

University College London ~ Experimental Psychology

Share

Multi-armed bandits are a useful paradigm to study how people balance exploration (learning about the value of options) and exploitation (choosing options with known high value). When options are distinguished by features predictive of reward, exploration aids generalization of experience to unknown options. The present study builds on our earlier work on human exploration and generalization in a feature-based bandit task (Stojic et al., 2020). Here, I present results from a new experiment where novel options are introduced regularly in three different environments: options either only provide rewards (gain), only provide punishments (loss), or can both provide rewards or punishments (mixed). Options were represented by randomly generated tree-like shapes, with features determining the angle and width of branches. Value of the options was a nonlinear function of the features. Regardless of the environment, people were quite good at choosing the best option. When first encountering each novel option, whether that option was chosen depended on the relative value of the option, indicative of successful function generalization. Compared to the other environments, exploration of novel options was generally larger in the loss environment. Computational modelling provides further insights into these results. We contrast a model that employs function learning through Gaussian Process regression with a new model that learns the value of options through a hierarchical Bayesian filter. Both models can employ a Bayesian mechanism to allow for asymmetric learning rates for positive vs negative reward prediction errors. Some evidence for such asymmetric learning is found.

Functional generalization and asymmetric learning in a feature-based bandit task

Keywords

Cite this as: