Short Abstract: We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. The framework is strictly more general than the low rank (or linear) MDP assumption of prior work. And we provide an algorithm with near optimal regret bound.

Reference: 1) https://arxiv.org/pdf/2003.00153.pdf