TY - JOUR
T1 - Empirical evaluation on discounted Thompson sampling for multi-armed bandit problem with piecewise-stationary Bernoulli arms
AU - Asyuraa, F. C.
AU - Abdullah, S.
AU - Sutanto, T. E.
N1 - Funding Information:
This research was funded by Directorate of Research and Development of Universitas Indonesia (DRPM UI) as a grant of Publikasi Terindeks Internasional (PUTI) Prosiding 2020 No. NKB-977/UN2.RST/HKP.05.00/2020. Authors wishing to acknowledge assistance or encouragement from colleagues, special work by technical staff, and financial support from Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Indonesia.
Publisher Copyright:
© 2021 Institute of Physics Publishing. All rights reserved.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/1/7
Y1 - 2021/1/7
N2 - The Multi-Armed Bandit problem is a problem in reinforcement learning that focuses on how to solve the exploration-exploitation dilemma. The dilemma is, given a set of options (known as arms) that you could try many times, how to balance between gathering information through experimenting with the available arms (exploration) or maximizing profit by choosing the seemingly best arm at that time (exploitation). The Multi-Armed Bandit problem is centered around determining which arm to choose at every round. Multi-Armed Bandit has gained its popularity as a more dynamic approach to a randomized trial, with its goal is to experiment with each available arm while still maximizing profit gained. An example of Multi-Armed Bandit in real life is in determining which film artwork should be shown to a visitor that would attract the visitor to watch that particular film. Bernoulli distribution with parameter θ is chosen to model the response of the visitor after seeing the artwork. Non-stationary condition on θ can be implemented to accommodate various trends in film artworks. Some artworks might be good at a certain month, but they could not be preferred in the next month. The non-stationary condition in this study is modeled through piecewise-stationary. We implemented a discounted Thompson sampling policy that used Bayesian method to determine which arm to choose at each round. Multiple simulations were conducted on various conditions to empirically test the policy's performance on various conditions. Evaluation was based on the cumulative regret. Based on these simulations, discounted Thompson sampling policy achieved relatively lower cumulative regret in tackling the stationary and piecewise-stationary conditions, compared to some well-known policies such as Epsilon Greedy, SoftMax, Upper Confidence Bound, and Thompson Sampling.
AB - The Multi-Armed Bandit problem is a problem in reinforcement learning that focuses on how to solve the exploration-exploitation dilemma. The dilemma is, given a set of options (known as arms) that you could try many times, how to balance between gathering information through experimenting with the available arms (exploration) or maximizing profit by choosing the seemingly best arm at that time (exploitation). The Multi-Armed Bandit problem is centered around determining which arm to choose at every round. Multi-Armed Bandit has gained its popularity as a more dynamic approach to a randomized trial, with its goal is to experiment with each available arm while still maximizing profit gained. An example of Multi-Armed Bandit in real life is in determining which film artwork should be shown to a visitor that would attract the visitor to watch that particular film. Bernoulli distribution with parameter θ is chosen to model the response of the visitor after seeing the artwork. Non-stationary condition on θ can be implemented to accommodate various trends in film artworks. Some artworks might be good at a certain month, but they could not be preferred in the next month. The non-stationary condition in this study is modeled through piecewise-stationary. We implemented a discounted Thompson sampling policy that used Bayesian method to determine which arm to choose at each round. Multiple simulations were conducted on various conditions to empirically test the policy's performance on various conditions. Evaluation was based on the cumulative regret. Based on these simulations, discounted Thompson sampling policy achieved relatively lower cumulative regret in tackling the stationary and piecewise-stationary conditions, compared to some well-known policies such as Epsilon Greedy, SoftMax, Upper Confidence Bound, and Thompson Sampling.
UR - http://www.scopus.com/inward/record.url?scp=85100778197&partnerID=8YFLogxK
U2 - 10.1088/1742-6596/1722/1/012096
DO - 10.1088/1742-6596/1722/1/012096
M3 - Conference article
AN - SCOPUS:85100778197
SN - 1742-6588
VL - 1722
JO - Journal of Physics: Conference Series
JF - Journal of Physics: Conference Series
IS - 1
M1 - 012096
T2 - 10th International Conference and Workshop on High Dimensional Data Analysis, ICW-HDDA 2020
Y2 - 12 October 2020 through 15 October 2020
ER -