Provably Efficient Reinforcement Learning

Cai, Qi

doi:https://doi.org/10.21985/n2-hyvv-3s67

Work

Provably Efficient Reinforcement Learning

Public

Download PDF

Download All Files (.zip)

While deep reinforcement learning achieves tremendous successes in practice, its efficiencies are rarely understood in theory. The dissertation contains three parts, with each part corresponding to the study of an independent theoretical reinforcement learning problem. The three parts in all discuss the mechanism of how (deep) reinforcement learning efficiently solves the decision making problems in (partially observable) Markov decision processes (MDP) with respect to the computation and sample complexity. The first part Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima studies how gradient-based algorithm makes reinforcement learning algorithm computable in the presence of overparametrized neural network approximations. In particular, it shows that the temporal-difference algorithm converges to the global optima of the mean-squared Bellman error in polynomial time, even though the objective lacks an unbiased gradient estimator. The second part Provably Efficient Exploration in Policy Optimization studies how policy-based online reinforcement learning algorithm can provably explore the environment and find the optimal policy. In specific, the results are established on a finite-dimensional linear combination assumption of the transition kernel of the MDP, It shows that the regret of the algorithm has an upper bound scaling with the squared root of the number of transition samples, which meets the theoretical lower bound. The third part Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency studies online reinforcement learning in partially observable Markov decision processes (POMDPs). It considers that both the transition and emission kernels of the POMDP have a low-rank linear representation and proves that optimistic algorithm can efficiently find the optimal policy of the POMDP in the presence of infinite state and observation spaces.

Creator