Deep relaxation: partial differential equations for optimizing deep neural networks (Q2319762)

The output of a deep neural network is defined as \(y(x;\xi)=\sigma(x^p\sigma(x^{p-1}\dots\sigma(x^1\xi)\dots)\) which is a nested composition of linear functions depending on inputs \(\xi\in \mathbb{R}^d\) and weights \(x\in \mathbb{R}^n\), which are the parameters of the network. Performing a supervised training the goal is to minimize a certain loss function \(f(x)\). In the background part, there is a short discussion on stochastic gradient descent (SGD) and stochastic gradient descent in continuous time methods, as well as a presentation of some references. The third section concentrates on few results on the PDE-interpretation of local entropy, as a derivation of the viscous Hamilton-Jacobi PDE or the Hopf-Lax formula for the Hamilton-Jacobi equation. The fourth section deals with the derivation of local entropy via homogenization of SDEs. Further, one shows results on stochastic control for a variant of local entropy. Since it has been proved, in a previous section, that the regularized loss function is the solution of a viscous Hamilton-Jacobi equation, one can apply semi-concavity estimates from PDE theory and quantify the amount of smoothing. Some examples to illustrate the widening of local minima are presented. The last two sections are devoted to a comparison of various algorithms presented in this paper. The aim is to show that the considered collection of PDE methods improve results on modern datasets as MNIST or CIFAR datasets. All along the article, one uses intensively the published results of the authors, compares results, investigates and improves some of the details.

0 references

reviewed by

Claudia Simionescu-Badea

0 references

zbMATH Keywords

deep learning

0 references

partial differential equations

0 references

stochastic gradient descent

0 references

neural networks