Finite-time analysis of the multiarmed bandit problem

5.0kCitations
Citations of this article
1.5kReaders
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256. https://doi.org/10.1023/A:1013689704352

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 804

77%

Researcher 154

15%

Professor / Associate Prof. 73

7%

Lecturer / Post doc 11

1%

Readers' Discipline

Tooltip

Computer Science 692

69%

Engineering 202

20%

Mathematics 75

7%

Business, Management and Accounting 38

4%

Article Metrics

Tooltip
Mentions
Blog Mentions: 1
News Mentions: 2
References: 11

Save time finding and organizing research with Mendeley

Sign up for free