Institutional Scholarship

Delayed Rewards and Dynamic States in the Multi-Armed Bandit Problem

Show simple item record

dc.contributor.advisor Parameswaran, Giri Newman, David 2019-09-01T21:48:17Z 2019-09-01T21:48:17Z 2019
dc.description.abstract This paper explores a variation of the multi-armed bandit problem and proposes a new strategy, the Pure Unknown strategy, to optimally maximize payoffs. In this game, the player chooses between two arms– one with known probability distributions and the other with unknown probability distributions–and does not realize the payoff of the arm she chooses until the next time period, where the probabilities of each time period are state dependent and those states are determined by a stochastic process. Additionally, elements of ambiguity aversion are incorporated into the model to reflect individuals’ preferences for choices with known probabilities over those with unknown probabilities. Four strategies, including the Pure Unknown strategy, play this game to see which strategy produces the highest average payoff, and the other three strategies are inspired by previous multi-armed bandit literature. Results find that the Pure Unknown strategy is the most optimal strategy when the Normality assumption, which ultimately represents ambiguity averse preferences, is not present.
dc.description.sponsorship Haverford College. Department of Economics
dc.language.iso eng
dc.subject.lcsh Stochastic processes
dc.subject.lcsh Mathematical optimization
dc.subject.lcsh Resource allocation -- Mathematical models
dc.title Delayed Rewards and Dynamic States in the Multi-Armed Bandit Problem
dc.type Thesis
dc.rights.access Open Access

Files in this item

This item appears in the following Collection(s)

Show simple item record Except where otherwise noted, this item's license is described as



My Account