Topics in Machine Learning Optimization

Fang, Biyi

doi:https://doi.org/10.21985/n2-n6pw-pp22

Work

Topics in Machine Learning Optimization

Public

Download PDF

Download All Files (.zip)

Recently, machine learning and deep learning, which have made many theoretical and empir- ical breakthroughs and is widely applied in various fields, attract a great number of researchers and practitioners. They have become one of the most popular research directions and plays a sig- nificant role in many fields, such as machine translation, speech recognition, image recognition, recommendation system, etc. Optimization, as one of the core components, attracts much attention of researchers. The essence of most machine learning and deep learning algorithms is to build an optimization model and learn the parameters in the objective function from the given data. With the exponential growth of data amount and the increase of model complexity, optimization methods in machine learning face more and more challenges. In the era of immense data, the effectiveness and efficiency of the numerical optimization algorithms dramatically influence the popularization and application of the machine learning and deep learning models. In this study, we propose a few effective optimization algorithms for different optimization problems, which have improved the performance and efficiency of machine learning and deep learning methods. This dissertation consists of four chapters, 1) Stochastic Large-scale Machine Learning Algorithms with Distributed Features and Observations, 2) Convergence Analyses of Online ADAM, 3) Topic Analysis for Text with Side Data and 4) Tricks and Plugins to GBM on Images and Sequences.3 In the first chapter, We propose a general stochastic offline algorithm where observations, fea- tures, and gradient components can be sampled in a double distributed setting, i.e., with both features and observations distributed. Moreover, very technical analyses establish convergence properties of the algorithm under different conditions on the learning rate (diminishing to zero or constant). Furthermore, computational experiments in Spark demonstrate a superior performance of our algorithm versus a benchmark in early iterations of the algorithm, which is due to the sto- chastic components of the algorithm. In the second chapter, we explore how to apply optimization algorithms with fixed learning rate in online learning. Online learning is an appealing learning paradigm, which is of great interest in practice due to the recent emergence of large scale applications. Standard online learning assumes a finite number of samples while in practice data is streamed infinitely. In such a setting gradient descent with a diminishing learning rate does not work. In this chapter, we first introduce regret with rolling window, a performance metric, which measures the performance of an algorithm on every fixed number of contiguous samples. Meanwhile, we propose a family of algorithms with a constant or adaptive learning rate and provide very technical analyses establishing regret bound properties. We cover the convex setting showing the regret of the order of the square root of the size of the window in the constant and dynamic learning rate scenarios. Our proof is applicable also to the standard online setting where we provide analyses of the same regret order (the previous proofs have flaws). We also study a two layer neural network setting with reLU activation. In this case we establish that if initial weights are close to a stationary point, the same regret bound is attainable. We conduct computational experiments demonstrating a superior performance of the proposed algorithms. In the third paper, we employ text with side data to tackle the limitations like cold-start and non- transparency in latent factor models (e.g., matrix factorization). We introduce a hybrid generative probabilistic model that combines a neural network with a latent topic model, which is a four-level hierarchical Bayesian model. In the model, each document is modeled as a finite mixture over an 4 underlying set of topics and each topic is modeled as an infinite mixture over an underlying set of topic probabilities. Furthermore, each topic probability is modeled as a finite mixture over side data. In the context of text, the neural network provides an overview distribution about side data for the corresponding text, which is the prior distribution in LDA to help perform topic grouping. The approach is evaluated on several different datasets, where the model is shown to outperform standard LDA and Dirichlet-multinomial regression (DMR) in terms of topic grouping, model perplexity, classification and comment generation.In the forth paper, we propose a new algorithm for boosting Deep Convolutional Neural Net- works (BoostCNN) to combine the merits of dynamic feature selection and BoostCNN, and an- other new family of algorithms combining boosting and transformers. To learn these new models, we introduce subgrid selection and importance sampling strategies and propose a set of algorithms to incorporate boosting weights into a deep learning architecture based on a least squares objective function. These algorithms not only reduce the required manual effort for finding an appropriate network architecture but also result in superior performance and lower running time. Experi- ments show that the proposed methods outperform benchmarks on several fine-grained classifica- tion tasks. The systematic retrospect and summary of the optimization methods from the perspective of machine learning are of great significance, which can offer guidance for both developments of optimization and machine learning research.

Creator