Machine Learning Notes - Gradient Descent

Stochastic Gradient Descent (SGD)

$$w := w - \text{learning_rate} \times \frac{\partial{L}}{\partial{w}}$$
In [3]:

SGD + Momentum

$$ v = \mu \times v - \text{learning_rate} \times \frac{\partial{L}}{\partial{w}} \\ w := w + v $$
In [5]:

RMSProp

RMSProp is an update rule that set per-parameter learning rates by using a running average of the second moments of gradients.

$$ \text{cache} = \text{decay_rate} \times \text{cache} + (1 - \text{decay_rate}) \times \frac{\partial{L}}{\partial{w}} .* \frac{\partial{L}}{\partial{w}}\\ w := w - \frac{\text{learning_rate} \times \frac{\partial{L}}{\partial{w}}}{(\sqrt{\text{cache}} + \epsilon)} $$
In [12]:

Adam

Adam extends RMSprop with a first-order gradient cache similar to momentum, and a bias correction mechanism to prevent large steps at the start of optimization. $$ m = \beta_1 \times m + (1-\beta_1) \times \frac{\partial{L}}{\partial{w}} \\ v = \beta_2 \times v + (1-\beta_2)\times(\frac{\partial{L}}{\partial{w}}.*\frac{\partial{L}}{\partial{w}}) \\ x += - \frac{\text{learning_rate} \times m}{\sqrt{v} + eps} \\ $$

The full Adam update also includes a bias correction mechanism, which compensates for the fact that in the first few time steps the vectors $m$,$v$ are both initialized and therefore biased at zero, before they fully “warm up”. With the bias correction mechanism, the update looks as follows: $$ m = \beta_1 \times m + (1-\beta_1)\times \frac{\partial{L}}{\partial{w}} \\ m_t = \frac{m}{1-\beta_1^t} \\ v = \beta_2 \times v + (1-\beta_2) \times (\frac{\partial{L}}{\partial{w}}.* \frac{\partial{L}}{\partial{w}}) \\ v_t = \frac{v}{1-\beta_2^t} \\ w := w - \frac{\text{learning_rate} \times m_t}{\sqrt{v_t} + \epsilon} \\ $$ Where $t$ is your iteration counter going from $1$ to infinity

In [14]:
This blog is converted from machine-learning-gradient-descent.ipynb
Written on May 13, 2021