On the diffusion approximation of nonconvex stochastic gradient descent

DOI10.4310/AMSA.2019.V4.N1.A1MaRDI QIDQ1734292zbMATH OpenWikidataFDO

Authors Wenqing Hu, Chris Junchi Li, Lei Li, Jian-Guo Liu

Publication date 27 March 2019

Published in Annals of Mathematical Sciences and Applications (Search for Journal in Brave)

Full work available at URL https://arxiv.org/abs/1705.07562

zbMATH Keywords

nonconvex optimization diffusion approximation stationary points stochastic gradient descent batch size

Mathematics Subject Classification ID

Nonconvex programming, global optimization (90C26)

Abstract: We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp.~saddle point): it escapes in a number of iterations exponentially (resp.~almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that one should increase the batch size at later stage for the SGD to be trapped in flat minimizers for better generalization.

Recommendations

Cited in

(22)

This page was built for publication: On the diffusion approximation of nonconvex stochastic gradient descent

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q1734292)