On large batch training and sharp minima: a Fokker-Planck perspective
DOI10.1007/S42519-020-00120-9zbMATH Open1451.90104arXiv2112.00987OpenAlexW3043999652MaRDI QIDQ828491FDOQ828491
Authors: Xiaowu Dai, Yuhua Zhu
Publication date: 8 January 2021
Published in: Journal of Statistical Theory and Practice (Search for Journal in Brave)
Full work available at URL: https://arxiv.org/abs/2112.00987
Recommendations
- On the diffusion approximation of nonconvex stochastic gradient descent
- Why does large batch training result in poor generalization? A comprehensive explanation and a better strategy from the viewpoint of stochastic optimization
- The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima
- A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization
- Stochastic gradient descent with noise of machine learning type. I: Discrete time analysis
Fokker-Planck equationstochastic gradient algorithmsharp minimadeep neural networklarge batch training
Numerical mathematical programming methods (65K05) Stochastic programming (90C15) PDEs in connection with statistics (35Q62)
Cites Work
- Title not available (Why is that?)
- Metastability in reversible diffusion processes. I: Sharp asymptotics for capacities and exit times
- Metastability in reversible diffusion processes. II: Precise asymptotics for small eigenvalues
- Kramers law: validity, derivations and generalisations
- Hypocoercivity
- Stochastic processes and applications. Diffusion processes, the Fokker-Planck and Langevin equations
- Deep relaxation: partial differential equations for optimizing deep neural networks
- Optimization methods for large-scale machine learning
- Title not available (Why is that?)
- Title not available (Why is that?)
- Stochastic modified equations for the asynchronous stochastic gradient descent
Cited In (4)
- The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima
- On the diffusion approximation of nonconvex stochastic gradient descent
- Why does large batch training result in poor generalization? A comprehensive explanation and a better strategy from the viewpoint of stochastic optimization
- An empirical study into finding optima in stochastic optimization of neural networks
Uses Software
This page was built for publication: On large batch training and sharp minima: a Fokker-Planck perspective
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q828491)