Machine Learning Notes - Backpropagation
- Backpropagation Introduction ¶
- Chain Rule ¶
- Local Gradient ¶
- Common Computation Blocks - Forward and Backward Pass ¶
- Computation Graph ¶
Backpropagation Introduction ¶
Chain Rule ¶
$$ y = f_1(x) \\ L = f_2(y) $$Chain rule:
$$ \frac{\partial{L}}{\partial{y}} = \frac{\partial{f_2(y)}}{\partial{y}} \\ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{x}} = \frac{\partial{f_2(y)}}{\partial{y}} \frac{\partial{f_1(x)}}{\partial{x}} $$# Chain Rule
import torch
x = torch.tensor([5], requires_grad=True, dtype=torch.float64)
y = 2 * x
L = 3 * y
dL = 1
dy = 3 * dL
dx = 2 * dy
L.backward()
print(f'dL/dx: {dx}')
print(f'Pytorch autograd {x.grad}')
#
Chain rule:
$$ \frac{\partial{L}}{\partial{y}} = \begin{bmatrix} \frac{\partial{g(y)}}{\partial{y_1}} \\ \frac{\partial{g(y)}}{\partial{y_2}} \\ \end{bmatrix}\\ \frac{\partial{L}}{\partial{x}} = J_x^y \frac{\partial{L}}{\partial{y}} = \begin{bmatrix} \frac{\partial{f_1(x)}}{\partial{x_1}}, \frac{\partial{f_2(x)}}{\partial{x_1}}\\ \frac{\partial{f_1(x)}}{\partial{x_2}}, \frac{\partial{f_2(x)}}{\partial{x_2}}\\ \frac{\partial{f_1(x)}}{\partial{x_3}}, \frac{\partial{f_2(x)}}{\partial{x_3}} \end{bmatrix} \frac{\partial{L}}{\partial{y}} $$# Gradients with respect to a vector
# Chain Rule
import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float64)
y = torch.zeros(2)
y[0] = x[0] + 2*x[1]
y[1] = x[1] + 2*x[2]
L = y.sum()
dL = 1
dy = torch.tensor([1, 1])
dx = torch.tensor([[1, 0],[2, 1], [0, 2]]).matmul(dy)
L.backward()
print(f'dL/dx: {dx}')
print(f'Pytorch autograd {x.grad}')
#
Local Gradient ¶
... ---> $\textbf{x}$ ---> $f(\textbf{x})$ ---> $\textbf{y}$ ---> ... ---> L
$$ \frac{\partial{L}}{\partial{\textbf{x}}} = J^{\textbf{y}}_\textbf{x}\frac{\partial{L}}{\partial{\textbf{y}}} $$$J^{\textbf{y}}_\textbf{x}$ is a Jacobian matrix, and it has the same number of rows as $\textbf{x}$
Common Computation Blocks - Forward and Backward Pass ¶
Vector dot product ¶
Forward pass: $$ z = x^Ty $$
Backward pass: $$ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{z}}y \\ \frac{\partial{L}}{\partial{y}} = \frac{\partial{L}}{\partial{z}}x $$ Where $[x]_{nx1}, [y]_{nx1}$
# Vector dot product
import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float64)
y = torch.tensor([3, 2, 1], requires_grad=True, dtype=torch.float64)
z = x.T.matmul(y)
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz: {z.grad}')
print(f'Pytorch autograd dx: {x.grad}')
print(f'Pytorch autograd dy: {y.grad}')
dx = z.grad * y
assert(torch.allclose(dx, x.grad))
dy = z.grad * x
assert(torch.all(torch.isclose(dy, y.grad)))
#
Vector product ¶
Forward pass: $$ Z = xy $$
Backward pass: $$ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{Z}}y^T \\ \frac{\partial{L}}{\partial{y}} = x^T\frac{\partial{L}}{\partial{Z}} $$ Where $[x]_{mx1}, [y]_{1xn}, [Z]_{mxn}$
# Vector product
import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float64).reshape(3,1)
y = torch.tensor([3, 2], requires_grad=True, dtype=torch.float64).reshape(1,2)
z = x.matmul(y)
x.retain_grad()
y.retain_grad()
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz:\n {z.grad}')
print(f'Pytorch autograd dx:\n {x.grad}')
print(f'Pytorch autograd dy:\n {y.grad}')
dx = z.grad.matmul(y.T)
assert(torch.allclose(dx, x.grad))
dy = x.T.matmul(z.grad)
assert(torch.all(torch.isclose(dy, y.grad)))
#
Vector & Matrix product ¶
Forward pass: $$ Z = Wx $$
Backward pass: $$ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{Z}} \\ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{Z}}x^T $$ Where $[x]_{Dx1}, [W]_{CxD}, [z]_{Cx1}$
# Vector product
import torch
x = torch.tensor([1, 2], requires_grad=True, dtype=torch.float64).reshape(2,1)
w = torch.tensor([[1, 2],[2, 1]], requires_grad=True, dtype=torch.float64).reshape(2,2)
z = w.matmul(x)
x.retain_grad()
w.retain_grad()
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz:\n {z.grad}')
print(f'Pytorch autograd dx:\n {x.grad}')
print(f'Pytorch autograd dy:\n {w.grad}')
dx = w.T.matmul(z.grad)
assert(torch.all(torch.isclose(dx, x.grad)))
dw = z.grad.matmul(x.T)
assert(torch.allclose(dw, w.grad))
#
Matrix & Matrix product ¶
Forward pass: $$ Z = XW $$
Backward pass: $$ \frac{\partial{L}}{\partial{X}} = \frac{\partial{L}}{\partial{Z}}W^T \\ \frac{\partial{L}}{\partial{W}} = X^T\frac{\partial{L}}{\partial{Z}} $$ Where $[x]_{NxD}, [W]_{DxC}, [z]_{NxC}$
# Vector product
import torch
x = torch.tensor([[1, 2, 3], [3, 2, 1]], requires_grad=True, dtype=torch.float64).reshape(2,3)
w = torch.tensor([[1, 2, 3],[2, 1, 3], [3, 1, 2]], requires_grad=True, dtype=torch.float64).reshape(3,3)
z = x.matmul(w)
x.retain_grad()
w.retain_grad()
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz:\n {z.grad}')
print(f'Pytorch autograd dx:\n {x.grad}')
print(f'Pytorch autograd dy:\n {w.grad}')
dw = x.T.matmul(z.grad)
assert(torch.all(torch.isclose(dw, w.grad)))
dx = z.grad.matmul(w.T)
assert(torch.allclose(dx, x.grad))
#
Softmax ¶
Forward pass: $$ p_i = \frac{e^{s_i}}{\sum_i{e^{s_i}}} \\ t = \sum_i{e^{s_i}} $$
Backward pass: $$ \frac{\partial{L}}{\partial{s}} = J^p_s\frac{\partial{L}}{\partial{p}} \\ \begin{cases} -p_ip_j &\text{$i \neq j$}\\ p_i(1-p_i) &\text{otherwise} \end{cases} $$ Where $[s]_{Cx1}, [p]_{Cx1}$
# Vector product
import torch
s = torch.tensor([[1, 2]], requires_grad=True, dtype=torch.float64).reshape(2,1)
s_exp = torch.exp(s)
t = torch.sum(s_exp)
p = s_exp / t
s.retain_grad()
p.retain_grad()
# log likelihood
L = -torch.sum(torch.log(p))
L.backward()
print(f'Pytorch autograd dp:\n {p.grad}')
print(f'Pytorch autograd ds:\n {s.grad}')
J = torch.zeros(2, 2, dtype=p.dtype)
J[0, 0] = p[0]*(1-p[0]) # i = 0, i = 0
J[1, 0] = -p[0]*p[1] # i = 0, j = 1
J[0, 1] = -p[1]*p[0] # i = 1, j = 0
J[1, 1] = p[1]*(1-p[1]) # i = 0, i = 1
ds = J.matmul(p.grad)
assert(torch.allclose(ds, s.grad))
#
Computation Graph ¶
Introduction ¶
The idea of computation graph is to divide the whole computation into small blocks. During the forward pass, we calculate the block outputs and record the outputs for calculating gradients in backward pass. In the backward pass, we calculate the gradients. This is exactly the same idea of the chain rule, but make it more clear when apply the chain rule.
For example, if we have a computation chain is like $ L = (2x + 3y)z$, where $x=1$, $y =2$, $z=3$, we can break it down as: $$ a = 2x \\ b = 3y \\ c = a+b \\ L = cz $$
In the forward pass, we get: $$ a = 2x = 2 \\ b = 3y = 6\\ c = a+b = 8\\ L = cz = 24 $$
In the backward pass, we get: $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{z}} = \frac{\partial{L}}{\partial{L}}\frac{\partial{L}}{\partial{z}} = 1\times8=8\\ \frac{\partial{L}}{\partial{c}} = \frac{\partial{L}}{\partial{L}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{a}} = \frac{\partial{c}}{\partial{a}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{b}} = \frac{\partial{c}}{\partial{b}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{x}} = \frac{\partial{a}}{\partial{x}}\frac{\partial{L}}{\partial{a}} = 2\times3=6\\ \frac{\partial{L}}{\partial{y}} = \frac{\partial{b}}{\partial{y}}\frac{\partial{L}}{\partial{b}} = 3\times3=9\\ $$
Cross Entropy Loss ¶
The mathematical expression of Cross Entropy Loss is described here .
A computation graph is shown as below that $[]$ are the outputs of block operation $()$.
$[W]_{C \times D}$ -\
\
(Wx) -> [s]_{Cx1} -> (softmax(s)) -> [p]_{Cx1} -> (cross_entropy_loss(p, label=c)) -> [loss]_{1x1}
/
$[x]_{D \times 1}$ -/
Consider a single case that $[W]_{3x2}$, $[x]_{2x1}$ and correct label index is $0$.
For the forward pass, we can get: $$ [s]_{3x1} = Wx \\ [p]_{3x1} = softmax(s) \\ p_s = p[0] = \begin{bmatrix}1, 0, 0\end{bmatrix}p\\ L = -log(p_s) $$
For the backward pass, we can derive the intermediate gradients one by one. $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{p_s}} = -\frac{1}{p_s} \\ \frac{\partial{L}}{\partial{p}} = -\frac{1}{p_s} \begin{bmatrix}1 \\ 0\\ 0\end{bmatrix} = \begin{bmatrix}-\frac{1}{p_s} \\ 0\\ 0\end{bmatrix} $$ Based on Softmax :
$$ \frac{\partial{L}}{\partial{s}} = J^p_s\frac{\partial{L}}{\partial{p}} = \begin{bmatrix} &p_0(1-p_0), &-p_0p_1, &-p_0p_2 \\ &-p_1p_0, &p_1(1-p_1), &-p_1p_2\\ &-p_2p_0, &-p_2p_1, &p_2(1-p_2) \end{bmatrix} \begin{bmatrix}-\frac{1}{p_s} \\ 0\\ 0\end{bmatrix} = \begin{bmatrix}p_0-1 \\ p_1\\ p_2\end{bmatrix} $$Based on Vector & Matrix product : $$ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{s}}x^T \\ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{s}} \\ $$
# Cross Entropy Loss
import torch
torch.set_printoptions(precision=4)
W = torch.randn(3, 2, requires_grad = True)
x = torch.randn(2, requires_grad = True)
c = torch.tensor(0)
# Forward
s = torch.matmul(W, x)
s.retain_grad()
# Numeric trick to prevent overflow of using exp().
exp_s = torch.exp(s - torch.max(s))
p = exp_s / torch.sum(exp_s)
loss = -torch.log(p[c])
print(f'\nProbabily distribution\n {p}')
print(f'\nCross entropy loss is {loss:.4f}')
# Backward
loss.backward()
ds = torch.tensor([p[0]-1, p[1], p[2]])
dW = ds.view(-1,1).matmul(x.view(1,-1))
dx = W.T.matmul(ds)
print(f'\nGradients of s: \n {ds}')
assert(torch.allclose(ds, s.grad))
print(f'\nGradients of W: \n {dW}')
assert(torch.allclose(dW, W.grad))
print(f'\nGradients of x: \n {dx}')
#
Hinge Loss ¶
The mathematical expression of Cross Entropy Loss is described here .
A computation graph is shown as below that $[]$ are the outputs of block operation $()$.
$[W]_{C \times D}$ -\
\
(Wx)->[s]_{Cx1}->(s_i-s_c+1, 0)->[t]->(max(t, 0)) ->[z] ->(sum(z)) ->[loss]_{1x1}
/
$[x]_{D \times 1}$ -/
Consider a single case that $[W]_{3x2}$, $[x]_{2x1}$ and correct label index is $0$.
For the forward pass, we can get: $$ [s]_{3x1} = Wx \\ [t]_{3x1} = \begin{bmatrix}0 \\ s_1-s_0+1\\ s_2-s_0+1\end{bmatrix} \\ [z]_{3x1} = \begin{bmatrix}0 \\ m(s_1-s_0+1) \\ n(s_2-s_0+1) \end{bmatrix} \\ L = z_1 + z_2 $$ Where $m=0$ if $s_1-s_0+1<=0$, otherwise $m=1$, $n=0$ if $s_2-s_0+1<=0$, otherwise $n=1$.
For the backward pass, we can derive the intermediate gradients one by one. $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{z}} = \begin{bmatrix}0 \\ 1\\ 1\end{bmatrix} \\ \frac{\partial{L}}{\partial{t}} = J^z_t\frac{\partial{L}}{\partial{z}}= \begin{bmatrix} &0, &0, &0 \\ &0, &m, &0 \\ &0, &0, &n \end{bmatrix} \begin{bmatrix}0 \\ 1\\ 1\end{bmatrix} = \begin{bmatrix}0 \\ m \\ n\end{bmatrix} \\ \frac{\partial{L}}{\partial{s}} = J^t_s\frac{\partial{L}}{\partial{t}} = \begin{bmatrix} &0, &-1, &-1 \\ &0, &1, &0 \\ &0, &0, &1 \end{bmatrix} \begin{bmatrix}0 \\ m \\ n\end{bmatrix} = \begin{bmatrix}-(m+n) \\ m \\ n\end{bmatrix} $$ Based on Vector & Matrix product : $$ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{s}}x^T \\ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{s}} \\ $$
# Cross Entropy Loss
import torch
torch.set_printoptions(precision=4)
W = torch.tensor([[1., 2.], [3., 4.], [1., 3.]], requires_grad = True)
x = torch.tensor([2., 1.], requires_grad = True)
c = torch.tensor(0)
# Forward
s = torch.matmul(W, x)
s.retain_grad()
t = s-s[c]+1
t[c] = 0
z = torch.clamp(t, 0)
m = float(t[1] > 0)
n = float(t[2] > 0)
loss = z[1] + z[2]
print(f'\Hinge loss is {loss:.4f}')
# Backward
loss.backward()
ds = torch.tensor([-m-n, m, n])
dW = ds.view(-1,1).matmul(x.view(1,-1))
dx = W.T.matmul(ds)
print(f'\nGradients of s: \n {ds}')
assert(torch.allclose(ds, s.grad))
print(f'\nGradients of W: \n {dW}')
assert(torch.allclose(dW, W.grad))
print(f'\nGradients of x: \n {dx}')
#