P. Absil, R. Mahony, and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM Journal on Optimization, vol.16, issue.2, pp.531-547, 2005.

N. Agarwal, B. Bullins, X. Chen, E. Hazan, K. Singh et al., Efficient full-matrix adaptive regularization, Proceedings of the 36th International Conference on Machine Learning, vol.97, pp.9-15, 2019.

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming, vol.116, issue.1-2, pp.5-16, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00803898

H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-?ojasiewicz inequality, Mathematics of Operations Research, vol.35, issue.2, pp.438-457, 2010.

H. Attouch, J. Bolte, and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized gauss-seidel methods, Mathematical Programming, vol.137, issue.1-2, pp.91-129, 2013.
URL : https://hal.archives-ouvertes.fr/inria-00636457

J. Bolte, A. Daniilidis, O. Ley, and L. Mazet, Characterizations of ?ojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the, vol.362, pp.3319-3363, 2010.

J. Bolte, S. Sabach, and M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming, vol.146, issue.1-2, pp.459-494, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00916090

J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd, First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems, SIAM Journal on Optimization, vol.28, issue.3, pp.2131-2151, 2018.

L. Bottou, F. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, Siam Review, vol.60, issue.2, pp.223-311, 2018.

X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of adam-type algorithms for non-convex optimization, International Conference on Learning Representations, 2019.

S. De, A. Mukherjee, and E. Ullah, Convergence guarantees for rmsprop and adam in non-convex optimization and their comparison to nesterov acceleration on autoencoders, 2018.

J. Diakonikolas and M. I. Jordan, Generalized momentum-based methods: A hamiltonian perspective, 2019.

T. Dozat, Incorporating nesterov momentum into adam, 2016.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, vol.12, pp.2121-2159, 2011.

P. Frankel, G. Garrigos, and J. Peypouquet, Splitting methods with variable metric for kurdyka?ojasiewicz functions and general convergence rates, Journal of Optimization Theory and Applications, vol.165, issue.3, pp.874-900, 2015.

V. Gupta, T. Koren, and Y. Singer, A unified approach to adaptive regularization in online and stochastic optimization, 2017.

P. R. Johnstone and P. Moulin, Convergence rates of inertial splitting schemes for nonconvex composite optimization, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4716-4720, 2017.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2015.

K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l'institut Fourier, vol.48, pp.769-783, 1998.

G. Li and T. K. Pong, Calculus of the exponent of kurdyka-?ojasiewicz inequality and its applications to linear convergence of first-order methods, Foundations of computational mathematics, vol.18, issue.5, pp.1199-1232, 2018.

Q. Li, Y. Zhou, Y. Liang, and P. Varshney, Convergence analysis of proximal gradient with momentum for nonconvex optimization, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2111-2119, 2017.

J. Liang, J. Fadili, and G. Peyré, A multi-step inertial forward-backward splitting method for nonconvex optimization, Advances in Neural Information Processing Systems, pp.4035-4043, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01658854

L. Liu, H. Jiang, P. He, W. Chen, X. Liu et al., On the variance of the adaptive learning rate and beyond, 2019.

S. ?ojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Leséquations aux dérivées partielles, vol.117, pp.87-89, 1963.

L. Luo, Y. Xiong, and Y. Liu, Adaptive gradient methods with dynamic bound of learning rate, International Conference on Learning Representations, 2019.

J. Ma and D. Yarats, Quasi-hyperbolic momentum and adam for deep learning, International Conference on Learning Representations, 2019.

H. B. Mcmahan and M. J. Streeter, Adaptive bound optimization for online convex optimization, COLT, pp.244-256, 2010.

P. Ochs, Local convergence of the heavy-ball method and ipiano for non-convex optimization, Journal of Optimization Theory and Applications, vol.177, issue.1, pp.153-180, 2018.

P. Ochs, Y. Chen, T. Brox, and T. Pock, ipiano: Inertial proximal algorithm for nonconvex optimization, SIAM Journal on Imaging Sciences, vol.7, issue.2, pp.1388-1419, 2014.

R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, International conference on machine learning, pp.1310-1318, 2013.

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, vol.4, issue.5, pp.1-17, 1964.

S. J. Reddi, S. Kale, and S. Kumar, On the convergence of adam and beyond, International Conference on Learning Representations, 2018.

H. Robbins and S. Monro, A stochastic approximation method. The annals of mathematical statistics, pp.400-407, 1951.

H. Robbins and S. Monro, A stochastic approximation method, Herbert Robbins Selected Papers, pp.102-109, 1985.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, If there existsk ? N s.t. H(zk) = H(z) , then H(zk +1 ) = H(z) and by the first point of Lemma B.1, xk +1 = xk and then (x k ) k?N is stationary and for all k ?k , H(z k ) = H(z) and the results of the theorem hold in this case (note thatz ? critH by Lemma 5.1). Therefore, we can assume now that H(z) < H(z k )?k > 0 since (H(z k )) k?N is nonincreasing and Equation (20) holds. One more time, International conference on machine learning, pp.1139-1147, 2013.

, From Lemma 5.1, we get d(z k , ?(z 0 )) ? 0 as k ? +? . Hence, for all ? > 0, there exists k 1 ? N s.t. d(z k , ?(z 0 )) < ? for all k > k 1 . Moreover, ?(z 0 ) is a nonempty compact set and H is finite and constant on it. Therefore, we can apply the uniformization Lemma 5.2 with ? = ?(z 0 ). Hence, for any k > l := max(k 0 , k 1 ), H(z k ) < H(z) + ? for all k > k 0