Short Abstract: We study the exploration problem when the reward feedback is absent. We provide an algorithm named RL-Express which achieves nearly minimax optimal sample complexity in episodic non-stationary MDP.

Reference: Fast active learning for pure exploration in reinforcement learning