Observational learning of Exploration Exploitation Strategies in Bandit Tasks
Situations requiring to balance exploration and exploitation are ubiquitous. In such, humans frequently have the chance to observe others. Participants performed restless nine-armed bandit tasks, either on their own or while seeing the choices of fictitious agents, which were equally good, but different regarding their tendency to explore. We used different Bayesian Mean Tracker models to fit participants data. Therein, individual choice probabilities are calculated from the expected values of all options using a softmax function, in which random exploration is implemented as temperature parameter while directed exploration biases the expected values of the options towards especially uncertain, informative options. We implemented copying in two different ways: in the unconditional copying model it is assumed that participants copy the observed agent with a fixed probability, independent of the subjective value estimations. In the copy when uncertain model, the probability of copying depends on the entropy in all options’ value estimations. Our results indicate that the copy when uncertain model can account better for participants data than the unconditional copying model. Participants use observational learning directly, i.e., they imitate the specific choices, but they also accommodate their individual exploration strategy towards the strategy of the observed agents.
Keywords
There is nothing here yet. Be the first to create a thread.
Cite this as: