Methods for Stochastic, Noisy, and Derivative-Free Optimization

Shi, Hao-Jun Michael

doi:https://doi.org/10.21985/n2-f1ch-pr51

Work

Methods for Stochastic, Noisy, and Derivative-Free Optimization

Public

Download PDF

Download All Files (.zip)

Existing nonlinear optimization methods have proven reliable over the past few decades for a wide range of applications but have critically relied on accurate function and gradient evaluations. Modern nonlinear optimization problems arising from machine learning and scientific computing applications are increasingly complex and large scale, which make accurate evaluations of the function intractable, but may only require an approximate solution that is within some large tolerance or neighborhood of the true solution. One natural approach towards cheapening the iteration is by utilizing inaccurate function and gradient information by sampling or allowing for computational noise. In this thesis, we develop scalable second-order methods for unconstrained optimization that can tolerate noise in the function and gradient. We consider the settings where the derivatives are available or unavailable (derivative-free optimization) under two frameworks of noise: (1) stochastic optimization, where the objective function is defined as the expected value of some function, where one can only obtain the sampled function and gradient; and (2) noisy optimization, where one obtains inaccurate function and gradient evaluations that are contaminated with computational noise. In the first project, we develop a new version of the L-BFGS algorithm for stochastic optimization. The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. Our method combines three basic components --- progressive batching, a stochastic line search, and stable quasi-Newton updating --- and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method. In the second project, we describe an extension of the BFGS and L-BFGS methods for the minimization of a nonlinear function subject to errors. This work is motivated by applications that contain computational noise, employ low-precision arithmetic, or are subject to statistical noise. The classical BFGS and L-BFGS methods can fail in such circumstances because the updating procedure can be corrupted and the line search can behave erratically. The proposed method addresses these difficulties and ensures that the BFGS update is stable by employing a lengthening procedure that spaces out the points at which gradient differences are collected. A new line search, designed to tolerate errors, guarantees that the Armijo-Wolfe conditions are satisfied under most reasonable conditions, and works in conjunction with the lengthening procedure. The proposed methods are shown to enjoy convergence guarantees for strongly convex functions. Detailed implementations of the methods are presented, together with encouraging numerical results. In the third project, we investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. In this approach, it is essential that the noise level in the functions be known or can be estimated by means of difference tables or sampling. Gradient approximations are calculated by finite differences, with a differencing interval determined by the noise level and a bound on the second or third derivatives. The use of finite differences has been largely dismissed in the derivative-free optimization literature as too expensive in terms of function evaluations and/or as impractical when the objective function contains noise. The test results presented in this paper suggest that such views should be re-examined and that the finite-difference approach has much to be recommended. The tests compared {\sc newuoa, dfo-ls} and {\sc cobyla} against the finite-difference approach on three classes of problems: general unconstrained problems, nonlinear least squares, and general nonlinear programs with equality constraints. In the fourth project, we consider the adaptive estimation of the finite-difference interval for noisy derivative-free optimization. In particular, when the function is noisy, the optimal choice requires information about the noise level and higher-order derivatives of the function, which is often unavailable. Given the noise level of the function, we propose a bisection search for finding a finite-difference interval for any finite-difference scheme that balances the \textit{truncation error}, which arises from the error in the Taylor series approximation, and the \textit{measurement error}, which results from noise in the function evaluation. Our procedure produces near-optimal estimates of the finite-difference interval at low cost without knowledge of the higher-order derivatives. We show its numerical reliability and accuracy on a set of test problems. When combined with L-BFGS, we obtain a robust method for minimizing noisy black-box functions, as illustrated on a subset of synthetically noisy unconstrained CUTEst problems.

Creator