Distributed Newton Methods for Deep Neural Networks
From MaRDI portal
Abstract: Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this paper, we focus on situations where the model is distributedly stored, and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. In compared with stochastic gradient methods, it is more robust and may give better test accuracy.
Recommendations
- Distributed Newton Method for Large-Scale Consensus Optimization
- Network Newton Distributed Optimization Methods
- Distributed Newton Optimization With Maximized Convergence Rate
- Accelerated Distributed Nesterov Gradient Descent
- Distributed Newton's Method for Network Cost Minimization
- Distributed approximate Newton algorithms and weight design for constrained optimization
- A Fast Distributed Asynchronous Newton-Based Optimization Algorithm
- Newton-like method with diagonal correction for distributed optimization
- Distributed adaptive Newton methods with global superlinear convergence
- Fast Distributed Gradient Methods
Cites work
- scientific article; zbMATH DE number 784362 (Why is no real title available?)
- A distributed block coordinate descent method for training l₁ regularized linear classifiers
- Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent
- Large-scale machine learning with stochastic gradient descent
- On the use of stochastic Hessian information in optimization methods for machine learning
- Some methods of speeding up the convergence of iteration methods
- Subsampled Hessian Newton Methods for Supervised Learning
Cited in
(9)- An inertial Newton algorithm for deep learning
- Block layer decomposition schemes for training deep neural networks
- A framework for parallel and distributed training of neural networks
- Newton-like method with diagonal correction for distributed optimization
- Distributed Deep Learning on Heterogeneous Computing Resources Using Gossip Communication
- Regulation cooperative control for heterogeneous uncertain chaotic systems with time delay: a synchronization errors estimation framework
- A distributed optimisation framework combining natural gradient with Hessian-free for discriminative sequence training
- Neural nets with a Newton conjugate gradient method on multiple GPUs
- Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees
This page was built for publication: Distributed Newton Methods for Deep Neural Networks
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5157199)