A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization
From MaRDI portal
Publication:5084492
DOI10.1287/STSY.2021.0083zbMATH Open1489.90097arXiv1802.05155OpenAlexW3210625965MaRDI QIDQ5084492FDOQ5084492
Authors: Tianyi Liu, Zhehui Chen, Enlu Zhou, Tuo Zhao
Publication date: 24 June 2022
Published in: Stochastic Systems (Search for Journal in Brave)
Abstract: Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning, e.g., training deep neural networks, variational Bayesian inference, and etc. Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.
Full work available at URL: https://arxiv.org/abs/1802.05155
Recommendations
- On the diffusion approximation of nonconvex stochastic gradient descent
- Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods
- Nonasymptotic estimates for stochastic gradient Langevin dynamics under local conditions in nonconvex optimization
- Convergence rates for the stochastic gradient descent method for non-convex objective functions
- Stochastic momentum methods for non-convex learning without bounded assumptions
- Convergence and dynamical behavior of the ADAM algorithm for nonconvex stochastic optimization
- Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization
- Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation
- On the Convergence of Stochastic Gradient Descent for Nonlinear Ill-Posed Problems
- On the discrepancy principle for stochastic gradient descent
Quadratic programming (90C20) Stochastic programming (90C15) Methods of reduced gradient type (90C52)
Cites Work
- Title not available (Why is that?)
- A Stochastic Approximation Method
- Stochastic differential equations. An introduction with applications.
- Accelerated gradient methods for nonconvex nonlinear and stochastic programming
- Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming
- Stochastic approximation with two time scales
- The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning
- Some methods of speeding up the convergence of iteration methods
- Handbook of simulation optimization
- A mean field view of the landscape of two-layer neural networks
Cited In (15)
- On large batch training and sharp minima: a Fokker-Planck perspective
- Global convergence of stochastic gradient Hamiltonian Monte Carlo for nonconvex stochastic optimization: nonasymptotic performance bounds and momentum-based acceleration
- Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation
- Stochastic momentum methods for non-convex learning without bounded assumptions
- Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance
- Stochastic gradient Hamiltonian Monte Carlo for non-convex learning
- Classical algebraic geometry. Abstracts from the workshop held June 20--26, 2021 (hybrid meeting)
- Convergence of the RMSProp deep learning method with penalty for nonconvex optimization
- Convergence and dynamical behavior of the ADAM algorithm for nonconvex stochastic optimization
- On Nonconvex Optimization for Machine Learning
- Convergence of the Momentum Method for Semialgebraic Functions with Locally Lipschitz Gradients
- On the diffusion approximation of nonconvex stochastic gradient descent
- Switched diffusion processes for non-convex optimization and saddle points search
- Conjugate-gradient-based Adam for nonconvex stochastic optimization and its application to deep learning
- On the influence of momentum acceleration on online learning
Uses Software
This page was built for publication: A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5084492)