Methods for Large Scale Nonlinear and Stochastic Optimization

Public Deposited

The goal of this thesis is to develop practical algorithms with sound theoretical properties for three different subfields of nonlinear optimization: (i) Optimization Algorithms for Supervised Machine Learning; (ii) Derivative-Free Optimization; and (iii) Distributed Optimization. As such, the thesis is divided into three main chapters. The focus of Chapter 2 is on optimization methods for supervised machine learning. Section 2.1 describes a batch method that uses a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employs second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. We show how to perform stable quasi-Newton updating in the multi-batch setting, study the convergence properties for both convex and nonconvex functions, and illustrate the behavior of the algorithm on a distributed computing platform on binary classification logistic regression and neural network training tasks that arise in machine learning. In Section 2.2 we investigate the concepts of sketching and subsampling in conjunction with optimization. Specifically, we study Newton-Sketch and Subsampled Newton methods for the finite-sum optimization problem. We consider practical versions of the two methods in which the Newton equations are solved approximately using the conjugate gradient (CG) method or a stochastic gradient iteration. We establish new complexity results for the Subsampled Newton-CG method that exploit the spectral properties of CG. Controlled numerical experiments compare the relative strengths of Newton-Sketch and Subsampled Newton methods, and show that for many finite-sum problems they are far more efficient than popular first-order methods. Minimizing a noisy black-box function with many variables remains an open problem in optimization. Current methodology is based on derivative-free methods designed for small scale deterministic problems. In Chapter 3, we advance the state-of-the-art by proposing a finite difference quasi-Newton algorithm that can deal with stochastic noise and that scales into the thousands of variables. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval h based on an estimate of the noise level in the functions. This noise estimation procedure and the selection of h are inexpensive, but not always accurate, and to prevent failures the algorithm incorporates a recovery procedure that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a model based trust region method are presented. Methods for distributed optimization have received significant attention in recent years owing to their wide applicability. A distributed optimization method typically consists of two key components: communication and computation. The standard way of judging an algorithm via only the number of iterations overlooks the complexity associated with each iteration. Motivated by this discrepancy, in Chapter 4 we propose an adaptive cost framework which adjusts the cost measure depending on the features of various applications. We present a flexible algorithmic framework, where communication and computation steps are explicitly decomposed to enable algorithm customization for various applications. We apply this framework to the well-known distributed gradient descent (DGD) method, and propose a class of customized algorithms that compare favorably to their base algorithms, both theoretically and empirically. Numerical experiments illustrate the performance and flexibility of the methods, as well as practical variants, on quadratic functions and classification problems that arise in machine learning, in terms of iterations, gradient evaluations, communications and the proposed cost framework

Last modified
  • 02/22/2019
Date created
Resource type
Rights statement