Comparative study of distributed deep learning tools on supercomputers

5Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the growth of the scale of data set and neural networks, the training time is increasing rapidly. Distributed parallel training has been proposed to accelerate deep neural network training, and most efforts are made on top of GPU clusters. This paper focuses on the performance of distributed parallel training in CPU clusters of supercomputer systems. Using resources at the supercomputer system of “Tianhe-2”, we conduct extensive evaluation of the performance of popular deep learning tools, including Caffe, TensorFlow, and BigDL, and several deep neural network models are tested, including AutoEncoder, LeNet, AlexNet and ResNet. The experiment results show that Caffe performs the best in communication efficiency and scalability. BigDL is the fastest in computing speed benefiting from its optimization for CPU, but it suffers from long communication delay due to the dependency on MapReduce framework. The insights and conclusions from our evaluation provides significant reference for improving resource utility of supercomputer resources in distributed deep learning.

Cite

CITATION STYLE

APA

Du, X., Kuang, D., Ye, Y., Li, X., Chen, M., Du, Y., & Wu, W. (2018). Comparative study of distributed deep learning tools on supercomputers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11334 LNCS, pp. 122–137). Springer Verlag. https://doi.org/10.1007/978-3-030-05051-1_9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free