LambdaLR ¶

Sets the learning rate of each parameter group to the initial $lr$ times a given function $f_{\lambda}$. When last_epoch=$-1$, sets initial $lr$ as $lr$.

$$ lr_{\text {epoch}} = lr_{\text {initial}} * f_{\lambda}(epoch) $$

In [49]:

# torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)

%matplotlib inline
import torch
import matplotlib.pyplot as plt

model_a = torch.nn.Linear(2, 1) 
model_b = torch.nn.Linear(2, 1) 

optimizer = torch.optim.SGD(model_a.parameters(), lr=100)
optimizer.add_param_group({'params': model_b.parameters(), 'momentum': 0.8})
lambda1 = lambda epoch: epoch // 5
lambda2 = lambda epoch: 0.5 ** epoch
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])

lrs_0 = []
lrs_1 = []

epochs = 10
for i in range(epochs):
    optimizer.step()
    lrs_0.append(optimizer.param_groups[0]["lr"])
    lrs_1.append(optimizer.param_groups[1]["lr"])
    scheduler.step()

plt.plot(range(epochs), lrs_0, '-*', label='model_a lr')
plt.plot(range(epochs), lrs_1, '-o', label='model_b lr')
plt.legend()
#

Out[49]:

<matplotlib.legend.Legend at 0x14a6ca4f0>

MultiplicativeLR ¶

Multiply the learning rate of each parameter group by the factor given in the specified function $f$. When last_epoch=-1, sets initial lr as lr.

$$ l r_{\text {epoch}} = l r_{\text {epoch - 1}} * f_{\lambda}(epoch) $$

In [43]:

# torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
lmbda = lambda epoch: 0.5 ** epoch
scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
lrs = []

for i in range(10):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(range(10),lrs,'-*')
#

Out[43]:

[<matplotlib.lines.Line2D at 0x14b129370>]

Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

$$ l r_{\text {epoch}}=\left\{\begin{array}{ll} \gamma * l r_{\text {epoch - 1}}, & \text { if } {\text {epoch % step_size}}=0 \\ l r_{\text {epoch - 1}}, & \text { otherwise } \end{array}\right. $$

In [32]:

# torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.5)
lrs = []

for i in range(10):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(range(10),lrs,'-*')
#

Out[32]:

[<matplotlib.lines.Line2D at 0x14ad77ac0>]

MultiStepLR ¶

Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

$$ l r_{\text {epoch}}=\left\{\begin{array}{ll} \gamma * l r_{\text {epoch - 1}}, & \text { if } {\text{ epoch in [milestones]}} \\ l r_{\text {epoch - 1}}, & \text { otherwise } \end{array}\right. $$

In [37]:

# torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[4, 8], gamma=0.5)
lrs = []

for i in range(10):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(range(10),lrs,'-*')
#

Out[37]:

[<matplotlib.lines.Line2D at 0x14af02850>]

ExponentialLR ¶

Decays the learning rate of each parameter group by gamma every epoch. When last_epoch=-1, sets initial lr as lr.

$$ l r_{\text {epoch}}= \gamma * l r_{\text {epoch - 1}} $$

In [46]:

# torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.5)
lrs = []


for i in range(10):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(range(10), lrs, '-*')

#

Out[46]:

[<matplotlib.lines.Line2D at 0x14b223820>]

ReduceLROnPlateau ¶

Reduce learning rate when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

In [109]:

# torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, 
#                                            threshold=0.0001, threshold_mode='rel', cooldown=0, 
#                                            min_lr=0, eps=1e-08, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=1, factor=0.5)

lrs = []

loss = [1 for i in range(6)]
loss.extend([0.1, 0.1, 0.1, 0.1, 0.1])

for i in range(10):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step(loss[i])

plt.plot(range(10), lrs, '-o')

#

Out[109]:

[<matplotlib.lines.Line2D at 0x14c436c10>]

CosineAnnealingLR ¶

Set the learning rate of each parameter group using a cosine annealing schedule. When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

$$ \eta_{t}=\eta_{\min }+\frac{1}{2}\left(\eta_{\max }-\eta_{\min }\right)\left(1+\cos \left(\frac{T_{c u r}}{T_{\max }} \pi\right)\right) $$

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts. https://arxiv.org/abs/1608.03983

In [58]:

# torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
T = 10
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=T, eta_min=10)
lrs = []


for i in range(4*T+1):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(range(4*T+1), lrs, '-o')

#

Out[58]:

[<matplotlib.lines.Line2D at 0x14b56fca0>]

CosineAnnealingWarmRestarts ¶

Set the learning rate of each parameter group using a cosine annealing schedule, and restarts after Ti epochs.

$$ \eta_{t}=\eta_{\min }+\frac{1}{2}\left(\eta_{\max }-\eta_{\min }\right)\left(1+\cos \left(\frac{T_{\operatorname{cur}}}{T_{i}} \pi\right)\right) $$

In [97]:

# torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0, 
#                                                      T_mult=1, eta_min=0, last_epoch=-1, verbose=False)

import torch
import matplotlib.pyplot as plt

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=10, last_epoch=-1)


lrs = []

for i in range(31):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(range(31), lrs, '-*')
#

Out[97]:

[<matplotlib.lines.Line2D at 0x14c256220>]

CyclicLR ¶

In [72]:

# torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000, 
#                                   step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, 
#                                   scale_mode='cycle', cycle_momentum=True, 
#                                   base_momentum=0.8, max_momentum=0.9, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001, max_lr=100,
                                              step_size_up=5, mode="triangular")
lrs = []


for i in range(100):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(lrs)

#

Out[72]:

[<matplotlib.lines.Line2D at 0x14b498730>]

OneCycleLR ¶

Sets the learning rate of each parameter group according to the 1 cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

The 1 cycle learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This scheduler is not chainable.

In [73]:

# torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, three_phase=False, last_epoch=-1, verbose=False)

model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=100, steps_per_epoch=10, epochs=10)
lrs = []


for i in range(100):
    optimizer.step()
    lrs.append(optimizer.param_groups[0]["lr"])
    scheduler.step()

plt.plot(lrs, '-*')

#

Out[73]:

[<matplotlib.lines.Line2D at 0x14b69cd00>]

Reference

Xipeng Wang

Learning Rate Scheduling