On the Optimality and Complexity of Reinforcement Learning

Fu, Zuyue

doi:https://doi.org/10.21985/n2-87dq-c112

Work

On the Optimality and Complexity of Reinforcement Learning

Public

Download PDF

Download All Files (.zip)

In this dissertation, we aim to develop algorithms that achieve optimality with provable complexity guarantees under various settings in reinforcement learning (RL). Specifically, in Markov decision processes (MDPs), we study single-agent and multi-agent online RL, respectively, and offline RL under the presence of unobserved confounders. Single-agent online RL. We design a single-timescale actor-critic method to solve single-agent RL, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. We prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations. Multi-agent online RL. We study discrete-time mean-field Markov games with infinite numbers of agents, where each agent aims to minimize its ergodic cost. Specifically, we consider a linear-quadratic case, where the agents have identical linear state transitions and quadratic cost functions. For such a game, we provide sufficient conditions for the existence and uniqueness of its Nash equilibrium, and also propose a model-free mean-field actor-critic algorithm. In particular, we prove that our algorithm converges to the Nash equilibrium at a linear rate. Offline RL with unobserved confounders. We study offline RL in the face of unobserved confounders. Offline RL is typically facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables (IVs). Specifically, we propose value- and ratio-based identification results for the identification of the expected total reward. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data coverage and modeling assumptions.

Creator