Gradient descent optimizes over-parameterized deep ReLU networks
From MaRDI portal
Publication:2183586
DOI10.1007/s10994-019-05839-6zbMath1494.68245OpenAlexW2981407587WikidataQ126992055 ScholiaQ126992055MaRDI QIDQ2183586
Dongruo Zhou, Yuan Cao, Difan Zou, Quanquan Gu
Publication date: 27 May 2020
Published in: Machine Learning (Search for Journal in Brave)
Full work available at URL: https://doi.org/10.1007/s10994-019-05839-6
Analysis of algorithms and problem complexity (68Q25) Artificial neural networks and deep learning (68T07)
Related Items (43)
Memory Capacity of Neural Networks with Threshold and Rectified Linear Unit Activations ⋮ Effects of depth, width, and initialization: A convergence analysis of layer-wise training for deep linear neural networks ⋮ Deep learning: a statistical viewpoint ⋮ Surprises in high-dimensional ridgeless least squares interpolation ⋮ Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration ⋮ Loss landscapes and optimization in over-parameterized non-linear systems and neural networks ⋮ Revisiting Landscape Analysis in Deep Neural Networks: Eliminating Decreasing Paths to Infinity ⋮ Particle dual averaging: optimization of mean field neural network with global convergence rate analysis* ⋮ A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions ⋮ Benign overfitting in linear regression ⋮ A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions ⋮ Full error analysis for the training of deep neural networks ⋮ Gradient descent optimizes over-parameterized deep ReLU networks ⋮ On the Benefit of Width for Neural Networks: Disappearance of Basins ⋮ Towards interpreting deep neural networks via layer behavior understanding ⋮ Deep learning in random neural fields: numerical experiments via neural tangent kernel ⋮ Non-differentiable saddle points and sub-optimal local minima exist for deep ReLU networks ⋮ Black holes and the loss landscape in machine learning ⋮ A rigorous framework for the mean field limit of multilayer neural networks ⋮ Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation ⋮ Convergence rates for shallow neural networks learned by gradient descent ⋮ On stochastic roundoff errors in gradient descent with low-precision computation ⋮ A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics ⋮ FedHD: communication-efficient federated learning from hybrid data ⋮ Growing axons: greedy learning of neural networks with application to function approximation ⋮ Normalization effects on deep neural networks ⋮ Greedy training algorithms for neural networks and applications to PDEs ⋮ Optimization for deep learning: an overview ⋮ Plateau Phenomenon in Gradient Descent Training of RELU Networks: Explanation, Quantification, and Avoidance ⋮ Unnamed Item ⋮ Non-convergence of stochastic gradient descent in the training of deep neural networks ⋮ Linearized two-layers neural networks in high dimension ⋮ Gradient convergence of deep learning-based numerical methods for BSDEs ⋮ Every Local Minimum Value Is the Global Minimum Value of Induced Model in Nonconvex Machine Learning ⋮ On the Effect of the Activation Function on the Distribution of Hidden Nodes in a Deep Network ⋮ Normalization effects on shallow neural networks and related asymptotic expansions ⋮ Wide neural networks of any depth evolve as linear models under gradient descent * ⋮ Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup* ⋮ Unnamed Item ⋮ Stabilize deep ResNet with a sharp scaling factor \(\tau\) ⋮ The interpolation phase transition in neural networks: memorization and generalization under lazy training ⋮ Provably training overparameterized neural network classifiers with non-convex constraints ⋮ Suboptimal Local Minima Exist for Wide Neural Networks with Smooth Activations
Uses Software
Cites Work
This page was built for publication: Gradient descent optimizes over-parameterized deep ReLU networks