Stochastic Gradient Descent (SGD) problems

Below we mention the most important problem of the SGD method for optimization.

High condition number for the loss function

if the ratio of largest to smallest in the hessian matrix is high, the learning rate could become much lower. Because of oscillation along the axis with a high gradient.

https://miro.medium.com/max/4308/1*ImvekfhM6sXo2IyAdslKLg.png

Local minima or saddle points

The gradient in local minima or saddle point is zero, so SGD gets stuck and can't find the optimal point.

https://www.mathsisfun.com/calculus/images/function-min-max.svg

Noisy gradient

In the SGD, we calculate the gradient in mini-batches, which approximates the gradient on training data, not the exact one. It could raise the number of iterations for convergence.

https://pythonmachinelearning.pro/wp-content/uploads/2017/09/GD-v-SGD-825x321.png.webp

SGD + Momentum

In SGD + Momentum, we suppose the gradient as acceleration, not velocity.

$$ v_0 = 0 \\v_{t+1} = \rho v_t + \nabla f(x_t)\\ x_{t+1} = x_t - \alpha v_{t+1} $$

This new approach could help to solve all the problems we talk about it above.

Nesterov Momentum

$$ v_0 = 0 \\v_{t+1} = \rho v_t - \alpha \nabla f(x_t + \rho v_t)\\ x_{t+1} = x_t + v_{t+1} $$

RMSProp