On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

arXiv1912.00018MaRDI QIDQ6330127FDOQ6330127

Authors: Umut Şimşekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël L. Richard, Levent Sagun

Publication date: 29 November 2019

Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the emph{generalized} CLT, which suggests that the GN converges to a emph{heavy-tailed}

a l p h a

-stable random vector, where emph{tail-index}

a l p h a

determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a L'{e}vy motion. Such SDEs can incur `jumps', which force the SDE and its discretization emph{transition} from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the

a l p h a

-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index

a l p h a

. To validate the

a l p h a

-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

This page was built for publication: On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6330127)