On large batch training and sharp minima: a Fokker-Planck perspective
From MaRDI portal
Abstract: We study the statistical properties of the dynamic trajectory of stochastic gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD as stochastic differential equations (SDEs). We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on the escaping phenomenon and the relationship with large batch and sharp minima. In particular, we find that the stochastic process solution tends to converge to flatter minima regardless of the batch size in the asymptotic regime. However, the convergence rate is rigorously proven to depend on the batch size. These results are validated empirically with various datasets and models.
Recommendations
- On the diffusion approximation of nonconvex stochastic gradient descent
- Why does large batch training result in poor generalization? A comprehensive explanation and a better strategy from the viewpoint of stochastic optimization
- The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima
- A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization
- Stochastic gradient descent with noise of machine learning type. I: Discrete time analysis
Cites work
- scientific article; zbMATH DE number 503287 (Why is no real title available?)
- scientific article; zbMATH DE number 6860839 (Why is no real title available?)
- scientific article; zbMATH DE number 5681750 (Why is no real title available?)
- Deep relaxation: partial differential equations for optimizing deep neural networks
- Hypocoercivity
- Kramers law: validity, derivations and generalisations
- Metastability in reversible diffusion processes. I: Sharp asymptotics for capacities and exit times
- Metastability in reversible diffusion processes. II: Precise asymptotics for small eigenvalues
- Optimization methods for large-scale machine learning
- Stochastic modified equations for the asynchronous stochastic gradient descent
- Stochastic processes and applications. Diffusion processes, the Fokker-Planck and Langevin equations
Cited in
(4)- The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima
- On the diffusion approximation of nonconvex stochastic gradient descent
- An empirical study into finding optima in stochastic optimization of neural networks
- Why does large batch training result in poor generalization? A comprehensive explanation and a better strategy from the viewpoint of stochastic optimization
This page was built for publication: On large batch training and sharp minima: a Fokker-Planck perspective
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q828491)