Methods for Deterministic and Stochastic Optimization

Bollapragada, Vijaya Raghavendra

doi:https://doi.org/10.21985/n2-nk86-4413

Work

Methods for Deterministic and Stochastic Optimization

Public

Download PDF

Download All Files (.zip)

The goal of this thesis is to design practical algorithms for nonlinear optimization in the case where the objective function is deterministic or stochastic. Problems of this nature arise in many applications including machine learning and image processing. The thesis is divided into four main chapters. Chapters \ref{chap:Inexact}, \ref{chap:Adasample} and \ref{chap:PBQN} deal with stochastic optimization problems or finite sum problems, and the fifth chapter deals with deterministic optimization problems. Chapter 2 is concerned with solving stochastic optimization problems in which approximations to the gradient and Hessian are obtained through subsampling. We first consider Newton-like methods that employ these approximations and discuss how to coordinate the accuracy in the gradient and Hessian to yield a superlinear rate of convergence in expectation. The second part of the chapter analyzes an inexact Newton method that solves linear systems approximately, using the conjugate gradient (CG) method, and that samples the Hessian and not the gradient (the gradient is assumed to be exact). We provide a complexity analysis for this method based on the properties of the CG iteration and the quality of the Hessian approximation, and compare it with a method that employs a stochastic gradient iteration instead of the CG method. We report numerical results that illustrate the performance of inexact subsampled Newton methods on machine learning applications based on logistic regression. The focus of Chapter 3 is on a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques, which require either additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size is triggered by an inner product test that ensures that search directions are descent directions with high probability. We show that the inner product test improves upon the well-known norm test, and can be used as a basis for an algorithm that is globally convergent on nonconvex functions and enjoys a global linear rate of convergence on strongly convex functions. Numerical experiments on logistic regression and nonlinear least squares problems illustrate the performance of the algorithm. The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In Chapter 4, we present a new version of the L-BFGS algorithm that combines three basic components — progressive batching, a stochastic line search, and stable quasi-Newton updating — and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method. Chapter 5 describes an acceleration scheme for multistep deterministic optimization algorithms. The extrapolated solution is written as a nonlinear average of the iterates produced by the original optimization algorithm. Our scheme does not need the underlying fixed-point operator to be symmetric; hence it handles primal-dual methods such as Chambolle-Pock. The weights are computed via a simple linear system, and we analyze performance in both online and offline modes. We use Crouzeix’s conjecture to show that acceleration is controlled by the solution of a Chebyshev problem on the numerical range of a nonsymmetric operator modeling the behavior of iterates near the optimum. We report numerical results that illustrate the performance of the acceleration scheme on image processing and logistic regression problems.

Creator