Łojasiewicz inequality

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

Polyak inequality

edit

A special case of the Łojasiewicz inequality, due to Polyak [ru], is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).

Definitions

edit

  is a function of type  , and has a continuous derivative  .

  is the subset of   on which   achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value   exists, unless otherwise stated. The optimization objective is to find some point   in  .

  are constants.

  is  -Lipschitz continuous iff

 

  is  -strongly convex iff 

  is  -PL (where "PL" means "Polyak-Łojasiewicz") iff 

Basic properties

edit

Theorem — 1. If   is  -PL, then it is invex.

2. If   is L-Lipschitz continuous, then  

3. If   is  -strongly convex then it is  -PL.

4. If   is  -strongly convex, and   is linear, then   is  -PL, where   is the smallest nonzero singular value of  .

5. (quadratic growth) If   is   -PL,   is a point, and   is the point on the optimum set that is closest to   in L2-norm, then  

Proof
Proof

1. By definition, every stationary point is a global minimum.

2. Integrate   for   and use the   -Lipschitz continuity.

3. By definition,  . Now, minimize the left side, we have   then minimize the right side, we have  Combining the two, we have the  -PL inequality.

 

4.  

Now, since  , we have  

Set   to be the projection of   to the optimum subspace, then we have  . Thus, we have   Vary the   on the right side to minimize the right side, we have the desired result.

5. Let  . For any  , we have   so by   -PL,
 

In particular, we see that   is a vector field on   with size at least  . Define a gradient flow along   with constant unit velocity, starting at  :  

Because   is bounded below by  , and  , the gradient flow terminates on the zero set   at a finite time   The path length is  , since the velocity is constantly 1.

Since   is on the zero set, and   is the point closest to  , we have   which is the desired result.

Gradient descent

edit

Theorem (linear convergence of gradient descent) — If   is  -PL and   is  -Lipschitz, then gradient descent with constant step size   converges linearly as 

The convergence is the fastest when  , at which point 

Proof
Proof

Since   is   -Lipschitz, we have the parabolic upper bound  

Plugging in the gradient descent step,  

Thus,  

Corollary — 1.   converges to the optimum set   at a rate of  .

2. If   is   -PL, not constant, and   is   -Lipschitz, then  .

3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

 

Coordinate descent

edit

The coordinate descent algorithm first samples a random coordinate   uniformly, then perform gradient descent by  

Theorem — Assume that   is   -PL, and that   is   -Lipschitz at each coordinate, meaning that  Then,   converges linearly for all   by  

Proof
Proof

By the same argument,  

Take expectation with respect to  , we have  

Plug in the   -PL inequality, we have  Iterating the process, we have the desired result.

Stochastic gradient descent

edit

In stochastic gradient descent, we have a function to minimize  , but we cannot sample its gradient directly. Instead, we sample a random gradient  , where   are such that   For example, in typical machine learning,   are the parameters of the neural network, and   is the loss incurred on the   -th training data point, while   is the average loss over all training data points.

The gradient update step is   where   are a sequence of learning rates (the learning rate schedule).

Theorem — If each   is   -Lipschitz,   is   -PL, and   has global mimimum  , then  We can also write it using the variance of gradient L2 norm:  

Proof
Proof

Because all   are   -Lipschitz, so is  . We thus have  

Now, take the expectation over  , and use the fact that   is   -PL. This gives the first equation.

The second equation is shown similarly by noting that  

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some   such that during the SG process, we have  for all  , then Similarly, if   then 

Learning rate schedules

edit

For constant learning rate schedule, with  , we have By induction, we have  We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the   term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and   starts doing a random walk in the vicinity of  .

For decreasing learning rate schedule with  , we have  .

References

edit
  • Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS, 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, MR 0972342, S2CID 56006439
  • Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society, 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, JSTOR 2153965, MR 1046016
  • Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
  • Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning. 59: 85–116. arXiv:2003.00307. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203.
edit