Work

Scalable Parallelization Strategy for Large-Scale Deep Learning

Public

Recently, a myriad of applications take advantage of deep learning methods to solve regression/classification problems. Although deep neural networks have shown powerful learning capability, many deep learning applications suffer from the extremely time-consuming training of the neural networks. In order to reduce the training time, researchers usually consider parallel training on distributed-memory systems. Synchronous Stochastic Gradient Descent (SGD) with data parallelism is the most popular parallelization strategy for neural network training, which guarantees the same convergence rate as the sequential training at the cost of having expensive inter-processing communications. Despite the poor scalability, many real-world deep learning applications usually adopt the synchronous parallel approach due to such a good convergence property. In this thesis, we discuss how to improve the scalability of synchronous parallel training in several different aspects. First, we propose a parallel training algorithm that leverages the overlaps of communications and computations across model layers. Our overlapping strategy makes a large portion of the communication time hidden behind the backpropagation computation time, and thus the scaling efficiency is improved. Second, we re-design the gradient computation method in data parallel training. The proposed gradient computation algorithm not only reduces the communication cost but also enables to overlap the communications with the forward computation at the next iteration. Finally, we propose an adaptive hyper-parameter adjustment method that improves the degree of parallelism while maintaining a good model accuracy. The proposed method gradually increases the batch size at run-time in order to make a good trade-off between the degree of parallelism and the generalization performance. All these three research works address different performance issues in synchronous parallel training. Our performance evaluation results demonstrate that, by harmonizing these separate contributions, the synchronous parallel training can effectively scale up on High-Performance Computing (HPC) platforms achieving the same classification/regression performance as the sequential training.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items