Effects of depth, width, and initialization: a convergence analysis of layer-wise training for deep linear neural networks
From MaRDI portal
Publication:5037872
Abstract: Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. However, training deep neural networks is a challenging task. Many alternatives have been proposed in place of end-to-end back-propagation. Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously. In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks. We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss. More importantly, the optimal learning rate can directly be applied in practice, as it does not require any prior knowledge. Thus, tuning the learning rate is not needed at all. Also, we identify the effects of depth, width, and initialization in the training process. We show that when the orthogonal-like initialization is employed, the width of intermediate layers plays no role in gradient-based training, as long as the width is greater than or equal to both the input and output dimensions. We show that under some conditions, the deeper the network is, the faster the convergence is guaranteed. This implies that in an extreme case, the global optimum is achieved after updating each weight matrix only once. Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered. Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.
Recommendations
- Non-convergence of stochastic gradient descent in the training of deep neural networks
- Effect of depth and width on local minima in deep learning
- Mean Field Analysis of Deep Neural Networks
- Mean field analysis of neural networks: a law of large numbers
- Convergence of stochastic gradient descent in deep neural network
Cites work
- scientific article; zbMATH DE number 6378127 (Why is no real title available?)
- A coordinate gradient descent method for nonsmooth separable minimization
- A randomized Kaczmarz algorithm with exponential convergence
- Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization
- Gradient descent optimizes over-parameterized deep ReLU networks
- Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks
- Randomized Kaczmarz solver for noisy linear systems
- Randomized extended Kaczmarz for solving least squares
- Randomized methods for linear constraints: convergence rates and conditioning
- Reducing the Dimensionality of Data with Neural Networks
- Theory of deep convolutional neural networks: downsampling
- Universality of deep convolutional neural networks
Cited in
(4)- Wide neural networks of any depth evolve as linear models under gradient descent *
- Convergence Rates of Training Deep Neural Networks via Alternating Minimization Methods
- Research on the effect of batch normalization on VGG-like neural networks
- A convergence analysis of Nesterov's accelerated gradient method in training deep linear neural networks
This page was built for publication: Effects of depth, width, and initialization: a convergence analysis of layer-wise training for deep linear neural networks
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5037872)