Machine Learning Notes - Gradient Descent

Introduction ¶
Stochastic Gradient Descent (SGD) ¶
SGD + Momentum ¶
RMSProp ¶
Adam ¶

Introduction ¶

Stochastic Gradient Descent (SGD) ¶

$$w := w - \text{learning_rate} \times \frac{\partial{L}}{\partial{w}}$$

In [3]:

def sgd(w, dw, config=None):
    '''
    Performs vanilla stochastic gradient descent.
    config format:
      - learning_rate: Scalar learning rate.
    '''
    
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)

    w -= config['learning_rate'] * dw
    return w, config
#

SGD + Momentum ¶

$$ v = \mu \times v - \text{learning_rate} \times \frac{\partial{L}}{\partial{w}} \\ w := w + v $$

In [5]:

def sgd_momentum(w, dw, config=None):
    '''
    Performs stochastic gradient descent with momentum.
    config format:
      - learning_rate: Scalar learning rate.
      - momentum: Scalar between 0 and 1 giving the momentum value.
        Setting momentum = 0 reduces to sgd.
      - velocity: A numpy array of the same shape as w and dw used to store a
        moving average of the gradients.
    '''
    
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', torch.zeros_like(w))
    next_w = None
    
    v = config['momentum']*v-config['learning_rate']*dw
    next_w = w + v
    
    config['velocity'] = v
    return next_w, config
#

RMSProp ¶

RMSProp is an update rule that set per-parameter learning rates by using a running average of the second moments of gradients.

$$ \text{cache} = \text{decay_rate} \times \text{cache} + (1 - \text{decay_rate}) \times \frac{\partial{L}}{\partial{w}} .* \frac{\partial{L}}{\partial{w}}\\ w := w - \frac{\text{learning_rate} \times \frac{\partial{L}}{\partial{w}}}{(\sqrt{\text{cache}} + \epsilon)} $$

In [12]:

def rmsprop(w, dw, config=None):
  '''
  Uses the RMSProp update rule, which uses a moving average of squared
  gradient values to set adaptive per-parameter learning rates.
  config format:
  - learning_rate: Scalar learning rate.
  - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
    gradient cache.
  - epsilon: Small scalar used for smoothing to avoid dividing by zero.
  - cache: Moving average of second moments of gradients.
  '''

  if config is None: config = {}
  config.setdefault('learning_rate', 1e-2)
  config.setdefault('decay_rate', 0.99)
  config.setdefault('epsilon', 1e-8)
  config.setdefault('cache', torch.zeros_like(w))

  next_w = None
  config['cache'] = config['decay_rate']*config['cache'] + (1- config['decay_rate']) * dw * dw
  next_w = w-config['learning_rate']*dw / (torch.sqrt(config['cache']) + config['epsilon'])

  return next_w, config
#

Adam ¶

Adam extends RMSprop with a first-order gradient cache similar to momentum, and a bias correction mechanism to prevent large steps at the start of optimization. $$ m = \beta_1 \times m + (1-\beta_1) \times \frac{\partial{L}}{\partial{w}} \\ v = \beta_2 \times v + (1-\beta_2)\times(\frac{\partial{L}}{\partial{w}}.*\frac{\partial{L}}{\partial{w}}) \\ x += - \frac{\text{learning_rate} \times m}{\sqrt{v} + eps} \\ $$

The full Adam update also includes a bias correction mechanism, which compensates for the fact that in the first few time steps the vectors $m$,$v$ are both initialized and therefore biased at zero, before they fully “warm up”. With the bias correction mechanism, the update looks as follows: $$ m = \beta_1 \times m + (1-\beta_1)\times \frac{\partial{L}}{\partial{w}} \\ m_t = \frac{m}{1-\beta_1^t} \\ v = \beta_2 \times v + (1-\beta_2) \times (\frac{\partial{L}}{\partial{w}}.* \frac{\partial{L}}{\partial{w}}) \\ v_t = \frac{v}{1-\beta_2^t} \\ w := w - \frac{\text{learning_rate} \times m_t}{\sqrt{v_t} + \epsilon} \\ $$ Where $t$ is your iteration counter going from $1$ to infinity

In [14]:

def adam(w, dw, config=None):
  '''
  Uses the Adam update rule, which incorporates moving averages of both the
  gradient and its square and a bias correction term.
  config format:
  - learning_rate: Scalar learning rate.
  - beta1: Decay rate for moving average of first moment of gradient.
  - beta2: Decay rate for moving average of second moment of gradient.
  - epsilon: Small scalar used for smoothing to avoid dividing by zero.
  - m: Moving average of gradient.
  - v: Moving average of squared gradient.
  - t: Iteration number.
  '''

  if config is None: config = {}
  config.setdefault('learning_rate', 1e-3)
  config.setdefault('beta1', 0.9)
  config.setdefault('beta2', 0.999)
  config.setdefault('epsilon', 1e-8)
  config.setdefault('m', torch.zeros_like(w))
  config.setdefault('v', torch.zeros_like(w))
  config.setdefault('t', 0)

  next_w = None
  config['t'] = config['t'] + 1
  b1 = config['beta1']
  b2 = config['beta2']
  t = config['t']
  m = b1*config['m']+(1-b1)*dw
  mt = m / (1-b1**t)
  v = b2*config['v']+(1-b2)*dw*dw
  vt = v / (1-b2**t)
  next_w = w - config['learning_rate']*mt / (torch.sqrt(vt) + config['epsilon'])
  config['m'] = m
  config['v'] = v
  return next_w, config
#

This blog is converted from machine-learning-gradient-descent.ipynb

Written on May 13, 2021