On large batch training and sharp minima: a Fokker-Planck perspective

DOI10.1007/S42519-020-00120-9MaRDI QIDQ828491zbMATH OpenOpenAlexFDO

Publication date 8 January 2021

Published in Journal of Statistical Theory and Practice (Search for Journal in Brave)

Full work available at URL https://arxiv.org/abs/2112.00987

Fokker-Planck equation stochastic gradient algorithm sharp minima deep neural network large batch training

Numerical mathematical programming methods (65K05) Stochastic programming (90C15) PDEs in connection with statistics (35Q62)

Abstract: We study the statistical properties of the dynamic trajectory of stochastic gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD as stochastic differential equations (SDEs). We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on the escaping phenomenon and the relationship with large batch and sharp minima. In particular, we find that the stochastic process solution tends to converge to flatter minima regardless of the batch size in the asymptotic regime. However, the convergence rate is rigorously proven to depend on the batch size. These results are validated empirically with various datasets and models.

Recommendations

Cites work

Cited in

(4)

Describes a project that uses

Uses Software

This page was built for publication: On large batch training and sharp minima: a Fokker-Planck perspective

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q828491)