Speaker: Tian Xu

Reference: Ménard, Pierre, et al. “UCB Momentum Q-learning: Correcting the bias without forgetting.” ICML, 2021.

Sparkling points: this paper proposes to leverage a momentum term to correct the bias that Q-learning suffers, while at the same time keeping a low variance. The resultant algorithm, named UCB Momentum Q-learning (UCB-MQ), achieves the minimax optimal regret bound. As a by-product, UCB-MQ is the first method whose second-order term only scales linearly with state space size.