An inertial Newton algorithm for deep learning
From MaRDI portal
Abstract: We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of -criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of . From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.
Recommendations
- An adaptive gradient method with energy and momentum
- Taming Neural Networks with TUSLA: Nonconvex Learning via Adaptive Stochastic Gradient Langevin Algorithms
- Inertial stochastic PALM and applications in machine learning
- Distributed Newton Methods for Deep Neural Networks
- A general system of differential equations to model first-order adaptive algorithms
Cites work
- scientific article; zbMATH DE number 3855514 (Why is no real title available?)
- scientific article; zbMATH DE number 5348356 (Why is no real title available?)
- scientific article; zbMATH DE number 46303 (Why is no real title available?)
- scientific article; zbMATH DE number 1160037 (Why is no real title available?)
- scientific article; zbMATH DE number 1972910 (Why is no real title available?)
- scientific article; zbMATH DE number 786608 (Why is no real title available?)
- scientific article; zbMATH DE number 1405930 (Why is no real title available?)
- A Stochastic Approximation Method
- A dynamical system associated with Newton's method for parametric approximations of convex minimization problems
- A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics.
- A stochastic quasi-Newton method for large-scale optimization
- Adaptive subgradient methods for online learning and stochastic optimization
- An investigation of Newton-sketch and subsampled Newton methods
- Analysis of recursive stochastic algorithms
- Characterizations of Łojasiewicz inequalities: Subgradient flows, talweg, convexity
- Clarke Subgradients of Stratifiable Functions
- Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning
- Convergence and dynamical behavior of the ADAM algorithm for nonconvex stochastic optimization
- Convergence of constant step stochastic gradient descent for non-smooth non-convex functions
- Convergence rate of inertial proximal algorithms with general extrapolation and proximal coefficients
- Learning representations by back-propagating errors
- Newton-type methods for non-convex optimization under inexact Hessian information
- On gradients of functions definable in o-minimal structures
- On the use of stochastic Hessian information in optimization methods for machine learning
- Optimization methods for large-scale machine learning
- Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Łojasiewicz Inequality
- Proximal alternating linearized minimization for nonconvex and nonsmooth problems
- Some methods of speeding up the convergence of iteration methods
- Stochastic Approximations and Differential Inclusions
- Stochastic Methods for Composite and Weakly Convex Optimization Problems
- Stochastic subgradient method converges on tame functions
- The Łojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems
Cited in
(29)- Stochastic Bregman subgradient methods for nonsmooth nonconvex optimization problems
- A Riemannian dimension-reduced second-order method with application in sensor network localization
- The backtrack Hölder gradient method with application to min-max and min-min problems
- Inertial Newton algorithms avoiding strict saddle points
- First order inertial optimization algorithms with threshold effects associated with dry friction
- Fast optimization via inertial dynamics with closed-loop damping
- Extrapolated plug-and-play three-operator splitting methods for nonconvex optimization with applications to image restoration
- A constraint dissolving approach for nonsmooth optimization over the Stiefel manifold
- Convergence of iterates for first-order optimization algorithms with inertia and Hessian driven damping
- Asymptotic behavior of Newton-like inertial dynamics involving the sum of potential and nonpotential terms
- AEGD: adaptive gradient descent with energy
- Subgradient Sampling for Nonsmooth Nonconvex Minimization
- Learning-rate-free momentum SGD with reshuffling converges in nonsmooth nonconvex optimization
- Heavy ball and Nesterov accelerations with Hessian-driven damping for nonconvex optimization
- The rate of convergence of optimization algorithms obtained via discretizations of heavy ball dynamical systems for convex optimization problems
- Continuous Newton-like Methods Featuring Inertia and Variable Mass
- On the effect of perturbations in first-order optimization methods with inertia and Hessian driven damping
- On a class of second-order dynamical systems involving viscous and Hessian-driven damping
- Conservative parametric optimality and the ridge method for tame min-max problems
- A fast and simple modification of Newton's method avoiding saddle points
- Newton-type inertial algorithms for solving monotone equations Governed by sums of potential and nonpotential operators
- An Improved Unconstrained Approach for Bilevel Optimization
- Fast convex optimization \textit{via} closed-loop time scaling of gradient dynamics
- Deep block proximal linearized minimization algorithm for nonconvex inverse problems
- Convergence of inertial dynamics driven by sums of potential and nonpotential operators with implicit Newton-like damping
- Long term dynamics of the subgradient method for Lipschitz path differentiable functions
- Nonsmooth nonconvex stochastic heavy ball
- Convergence properties of stochastic proximal subgradient method in solving a class of composite optimization problems with cardinality regularizer
- scientific article; zbMATH DE number 7523120 (Why is no real title available?)
This page was built for publication: An inertial Newton algorithm for deep learning
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5159400)