Gradient descent optimizes over-parameterized deep ReLU networks

DOI10.1007/s10994-019-05839-6zbMath1494.68245OpenAlexW2981407587WikidataQ126992055 ScholiaQ126992055MaRDI QIDQ2183586

Dongruo Zhou, Yuan Cao, Difan Zou, Quanquan Gu

Publication date: 27 May 2020

Published in: Machine Learning (Search for Journal in Brave)

Full work available at URL: https://doi.org/10.1007/s10994-019-05839-6

zbMATH Keywords

global convergence gradient descent over-parameterization deep neural networks random initialization

Mathematics Subject Classification ID

Analysis of algorithms and problem complexity (68Q25) Artificial neural networks and deep learning (68T07)

Related Items (43)

Memory Capacity of Neural Networks with Threshold and Rectified Linear Unit Activations ⋮ Effects of depth, width, and initialization: A convergence analysis of layer-wise training for deep linear neural networks ⋮ Deep learning: a statistical viewpoint ⋮ Surprises in high-dimensional ridgeless least squares interpolation ⋮ Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration ⋮ Loss landscapes and optimization in over-parameterized non-linear systems and neural networks ⋮ Revisiting Landscape Analysis in Deep Neural Networks: Eliminating Decreasing Paths to Infinity ⋮ Particle dual averaging: optimization of mean field neural network with global convergence rate analysis* ⋮ A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions ⋮ Benign overfitting in linear regression ⋮ A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions ⋮ Full error analysis for the training of deep neural networks ⋮ Gradient descent optimizes over-parameterized deep ReLU networks ⋮ On the Benefit of Width for Neural Networks: Disappearance of Basins ⋮ Towards interpreting deep neural networks via layer behavior understanding ⋮ Deep learning in random neural fields: numerical experiments via neural tangent kernel ⋮ Non-differentiable saddle points and sub-optimal local minima exist for deep ReLU networks ⋮ Black holes and the loss landscape in machine learning ⋮ A rigorous framework for the mean field limit of multilayer neural networks ⋮ Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation ⋮ Convergence rates for shallow neural networks learned by gradient descent ⋮ On stochastic roundoff errors in gradient descent with low-precision computation ⋮ A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics ⋮ FedHD: communication-efficient federated learning from hybrid data ⋮ Growing axons: greedy learning of neural networks with application to function approximation ⋮ Normalization effects on deep neural networks ⋮ Greedy training algorithms for neural networks and applications to PDEs ⋮ Optimization for deep learning: an overview ⋮ Plateau Phenomenon in Gradient Descent Training of RELU Networks: Explanation, Quantification, and Avoidance ⋮ Unnamed Item ⋮ Non-convergence of stochastic gradient descent in the training of deep neural networks ⋮ Linearized two-layers neural networks in high dimension ⋮ Gradient convergence of deep learning-based numerical methods for BSDEs ⋮ Every Local Minimum Value Is the Global Minimum Value of Induced Model in Nonconvex Machine Learning ⋮ On the Effect of the Activation Function on the Distribution of Hidden Nodes in a Deep Network ⋮ Normalization effects on shallow neural networks and related asymptotic expansions ⋮ Wide neural networks of any depth evolve as linear models under gradient descent ^* ⋮ Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup* ⋮ Unnamed Item ⋮ Stabilize deep ResNet with a sharp scaling factor \(\tau\) ⋮ The interpolation phase transition in neural networks: memorization and generalization under lazy training ⋮ Provably training overparameterized neural network classifiers with non-convex constraints ⋮ Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations

Uses Software

Cites Work

This page was built for publication: Gradient descent optimizes over-parameterized deep ReLU networks