Learning Rate Scheduling
- LambdaLR ¶
- MultiplicativeLR ¶
- StepLR ¶
- MultiStepLR ¶
- ExponentialLR ¶
- ReduceLROnPlateau ¶
- CosineAnnealingLR ¶
- CosineAnnealingWarmRestarts ¶
- CyclicLR ¶
- OneCycleLR ¶
# torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)
%matplotlib inline
import torch
import matplotlib.pyplot as plt
model_a = torch.nn.Linear(2, 1)
model_b = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model_a.parameters(), lr=100)
optimizer.add_param_group({'params': model_b.parameters(), 'momentum': 0.8})
lambda1 = lambda epoch: epoch // 5
lambda2 = lambda epoch: 0.5 ** epoch
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=[lambda1, lambda2])
lrs_0 = []
lrs_1 = []
epochs = 10
for i in range(epochs):
optimizer.step()
lrs_0.append(optimizer.param_groups[0]["lr"])
lrs_1.append(optimizer.param_groups[1]["lr"])
scheduler.step()
plt.plot(range(epochs), lrs_0, '-*', label='model_a lr')
plt.plot(range(epochs), lrs_1, '-o', label='model_b lr')
plt.legend()
#
MultiplicativeLR ¶
Multiply the learning rate of each parameter group by the factor given in the specified function $f$. When last_epoch=-1, sets initial lr as lr.
$$ l r_{\text {epoch}} = l r_{\text {epoch - 1}} * f_{\lambda}(epoch) $$# torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
lmbda = lambda epoch: 0.5 ** epoch
scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10),lrs,'-*')
#
StepLR ¶
Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.
$$ l r_{\text {epoch}}=\left\{\begin{array}{ll} \gamma * l r_{\text {epoch - 1}}, & \text { if } {\text {epoch % step_size}}=0 \\ l r_{\text {epoch - 1}}, & \text { otherwise } \end{array}\right. $$# torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.5)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10),lrs,'-*')
#
MultiStepLR ¶
Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.
$$ l r_{\text {epoch}}=\left\{\begin{array}{ll} \gamma * l r_{\text {epoch - 1}}, & \text { if } {\text{ epoch in [milestones]}} \\ l r_{\text {epoch - 1}}, & \text { otherwise } \end{array}\right. $$# torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[4, 8], gamma=0.5)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10),lrs,'-*')
#
ExponentialLR ¶
Decays the learning rate of each parameter group by gamma every epoch. When last_epoch=-1, sets initial lr as lr.
$$ l r_{\text {epoch}}= \gamma * l r_{\text {epoch - 1}} $$# torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.5)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10), lrs, '-*')
#
ReduceLROnPlateau ¶
Reduce learning rate when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.
# torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10,
# threshold=0.0001, threshold_mode='rel', cooldown=0,
# min_lr=0, eps=1e-08, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=1, factor=0.5)
lrs = []
loss = [1 for i in range(6)]
loss.extend([0.1, 0.1, 0.1, 0.1, 0.1])
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step(loss[i])
plt.plot(range(10), lrs, '-o')
#
CosineAnnealingLR ¶
Set the learning rate of each parameter group using a cosine annealing schedule. When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:
$$ \eta_{t}=\eta_{\min }+\frac{1}{2}\left(\eta_{\max }-\eta_{\min }\right)\left(1+\cos \left(\frac{T_{c u r}}{T_{\max }} \pi\right)\right) $$It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts. https://arxiv.org/abs/1608.03983
# torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
T = 10
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=T, eta_min=10)
lrs = []
for i in range(4*T+1):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(4*T+1), lrs, '-o')
#
CosineAnnealingWarmRestarts ¶
Set the learning rate of each parameter group using a cosine annealing schedule, and restarts after Ti epochs.
$$ \eta_{t}=\eta_{\min }+\frac{1}{2}\left(\eta_{\max }-\eta_{\min }\right)\left(1+\cos \left(\frac{T_{\operatorname{cur}}}{T_{i}} \pi\right)\right) $$# torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0,
# T_mult=1, eta_min=0, last_epoch=-1, verbose=False)
import torch
import matplotlib.pyplot as plt
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=10, last_epoch=-1)
lrs = []
for i in range(31):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(31), lrs, '-*')
#
# torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000,
# step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None,
# scale_mode='cycle', cycle_momentum=True,
# base_momentum=0.8, max_momentum=0.9, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001, max_lr=100,
step_size_up=5, mode="triangular")
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
#
OneCycleLR ¶
Sets the learning rate of each parameter group according to the 1 cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.
The 1 cycle learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.
This scheduler is not chainable.
# torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, three_phase=False, last_epoch=-1, verbose=False)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=100, steps_per_epoch=10, epochs=10)
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs, '-*')
#