Towards Understanding Deep Learning via Statistical Physics

Wei, Mingwei

doi:https://doi.org/10.21985/n2-2qgm-g149

Work

Towards Understanding Deep Learning via Statistical Physics

Public

Download PDF

Download All Files (.zip)

Deep neural networks have achieved remarkable success in the past decade on tasks that were out of reach prior to the era of deep learning. Amongst the myriad reasons for these successes are powerful computational resources, large datasets, new optimization algorithms, and modern architecture designs. Most of the reasons are still beyond being understood due to the lack of direct theory and analysis technique. Physics especially statistical physics, on the other hand, has been developed over years with rich theories and models that apply to real-world problems. The main goal of this dissertation is to generalize physical theories to the regime of deep learning, in order to deepen our understanding of phenomena observed among deep learning development and application and further improve it. Meanwhile these generalized theories and techniques also broaden our scope of the physical world. The main part of this dissertation focuses on analyzing stochastic gradient descent (SGD), a optimization algorithm widely used in deep learning, by generalizing the fluctuation-dissipation theorem and the theory of thermophoresis. SGD forms the core optimization method for deep neural networks. While some theoretical progress has been made, it still remains unclear why SGD leads the learning dynamics in overparameterized networks to solutions that generalize well. In this thesis, we study the effect of SGD with two perspectives. One effect focuses on the later part of training while the other one is more significant during the early phase of training. For the perspective focus on the later part of training, we show that for overparameterized networks with a degenerate valley in their loss landscape, SGD on average decreases the trace of the Hessian of the loss by generalized fluctuation-dissipation theorem. We also generalize this result to other noise structures and show that isotropic noise in the non-degenerate subspace of the Hessian decreases its determinant. In addition to explaining SGDs role in sculpting the Hessian spectrum, this opens the door to new optimization approaches that may confer better generalization performance. We test our results with experiments on toy models and deep neural networks. For the perspective, it is proved that SGD biases the model into area with less gradient variance by an effective thermal force. To be specific, SGD optimization reduces the activation rate of hidden nodes in a deep neural nets meanwhile also restricts the weight norm. Area with less gradient variance can be understood as cold regime and therefore this is a thermophoresis in deep learning. We generalize the physical theory and show the existence of this effect in SGD and that the strength of it is proportional to squared learning rate and inverse batch size. Together with learning theory, this result sheds light on the understanding of how SGD in phase 1 is essential for optimization. Furthermore, this dissertation also apply mean field theory to a well-known technique in deep learning called batch normalization (BatchNorm). BatchNorm is an extremely useful component of modern neural network architectures, enabling optimization using higher learning rates and achieving faster convergence. We use mean-field theory to analytically quantify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. We show that it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix. These findings are then used to justify the use of larger learning rates for networks that use BatchNorm, and we provide quantitative characterization of the maximal allowable learning rate to ensure convergence. Experiments support our theoretically predicted maximum learning rate, and furthermore suggest that networks with smaller values of the BatchNorm parameter $\gamma$ achieve lower loss after the same number of epochs of training.

Creator