Slides

Short Abstract: We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels, where the goal is to optimize the total reward under the risk measure of exponential utility. We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ). We establish a regret lower bound showing that the exponential dependence on risk parameter and horizon is unavoidable for any algorithm with a sublinear regret thus certifying the near-optimality of the proposed algorithms.

Reference: 1) https://arxiv.org/abs/2006.13827 2) http://proceedings.mlr.press/v125/jin20a/jin20a.pdf