Machine Learning Notes - Gradient Descent
Introduction ¶
Stochastic Gradient Descent (SGD) ¶
def sgd(w, dw, config=None):
'''
Performs vanilla stochastic gradient descent.
config format:
- learning_rate: Scalar learning rate.
'''
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
w -= config['learning_rate'] * dw
return w, config
#
SGD + Momentum ¶
$$ v = \mu \times v - \text{learning_rate} \times \frac{\partial{L}}{\partial{w}} \\ w := w + v $$def sgd_momentum(w, dw, config=None):
'''
Performs stochastic gradient descent with momentum.
config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value.
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a
moving average of the gradients.
'''
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', torch.zeros_like(w))
next_w = None
v = config['momentum']*v-config['learning_rate']*dw
next_w = w + v
config['velocity'] = v
return next_w, config
#
RMSProp ¶
RMSProp is an update rule that set per-parameter learning rates by using a running average of the second moments of gradients.
$$ \text{cache} = \text{decay_rate} \times \text{cache} + (1 - \text{decay_rate}) \times \frac{\partial{L}}{\partial{w}} .* \frac{\partial{L}}{\partial{w}}\\ w := w - \frac{\text{learning_rate} \times \frac{\partial{L}}{\partial{w}}}{(\sqrt{\text{cache}} + \epsilon)} $$def rmsprop(w, dw, config=None):
'''
Uses the RMSProp update rule, which uses a moving average of squared
gradient values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
'''
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', torch.zeros_like(w))
next_w = None
config['cache'] = config['decay_rate']*config['cache'] + (1- config['decay_rate']) * dw * dw
next_w = w-config['learning_rate']*dw / (torch.sqrt(config['cache']) + config['epsilon'])
return next_w, config
#
Adam ¶
Adam extends RMSprop with a first-order gradient cache similar to momentum, and a bias correction mechanism to prevent large steps at the start of optimization. $$ m = \beta_1 \times m + (1-\beta_1) \times \frac{\partial{L}}{\partial{w}} \\ v = \beta_2 \times v + (1-\beta_2)\times(\frac{\partial{L}}{\partial{w}}.*\frac{\partial{L}}{\partial{w}}) \\ x += - \frac{\text{learning_rate} \times m}{\sqrt{v} + eps} \\ $$
The full Adam update also includes a bias correction mechanism, which compensates for the fact that in the first few time steps the vectors $m$,$v$ are both initialized and therefore biased at zero, before they fully “warm up”. With the bias correction mechanism, the update looks as follows: $$ m = \beta_1 \times m + (1-\beta_1)\times \frac{\partial{L}}{\partial{w}} \\ m_t = \frac{m}{1-\beta_1^t} \\ v = \beta_2 \times v + (1-\beta_2) \times (\frac{\partial{L}}{\partial{w}}.* \frac{\partial{L}}{\partial{w}}) \\ v_t = \frac{v}{1-\beta_2^t} \\ w := w - \frac{\text{learning_rate} \times m_t}{\sqrt{v_t} + \epsilon} \\ $$ Where $t$ is your iteration counter going from $1$ to infinity
def adam(w, dw, config=None):
'''
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
'''
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', torch.zeros_like(w))
config.setdefault('v', torch.zeros_like(w))
config.setdefault('t', 0)
next_w = None
config['t'] = config['t'] + 1
b1 = config['beta1']
b2 = config['beta2']
t = config['t']
m = b1*config['m']+(1-b1)*dw
mt = m / (1-b1**t)
v = b2*config['v']+(1-b2)*dw*dw
vt = v / (1-b2**t)
next_w = w - config['learning_rate']*mt / (torch.sqrt(vt) + config['epsilon'])
config['m'] = m
config['v'] = v
return next_w, config
#