Backpropagation Introduction ¶

Chain Rule ¶

$$ y = f_1(x) \\ L = f_2(y) $$

Chain rule:

$$ \frac{\partial{L}}{\partial{y}} = \frac{\partial{f_2(y)}}{\partial{y}} \\ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{x}} = \frac{\partial{f_2(y)}}{\partial{y}} \frac{\partial{f_1(x)}}{\partial{x}} $$

In [56]:

# Chain Rule

import torch
x = torch.tensor([5], requires_grad=True, dtype=torch.float64)
y = 2 * x
L = 3 * y
dL = 1
dy = 3 * dL
dx = 2 * dy
L.backward()
print(f'dL/dx: {dx}')
print(f'Pytorch autograd {x.grad}')

#

dL/dx: 6
Pytorch autograd tensor([6.], dtype=torch.float64)

$$ \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = f(\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}) = \begin{bmatrix} f_{1}(x_1, x_2, x_3) \\ f_{2}(x_1, x_2, x_3) \end{bmatrix}\\ L = g(\begin{bmatrix} y_1 \\ y_2 \end{bmatrix}) \\ $$

Chain rule:

$$ \frac{\partial{L}}{\partial{y}} = \begin{bmatrix} \frac{\partial{g(y)}}{\partial{y_1}} \\ \frac{\partial{g(y)}}{\partial{y_2}} \\ \end{bmatrix}\\ \frac{\partial{L}}{\partial{x}} = J_x^y \frac{\partial{L}}{\partial{y}} = \begin{bmatrix} \frac{\partial{f_1(x)}}{\partial{x_1}}, \frac{\partial{f_2(x)}}{\partial{x_1}}\\ \frac{\partial{f_1(x)}}{\partial{x_2}}, \frac{\partial{f_2(x)}}{\partial{x_2}}\\ \frac{\partial{f_1(x)}}{\partial{x_3}}, \frac{\partial{f_2(x)}}{\partial{x_3}} \end{bmatrix} \frac{\partial{L}}{\partial{y}} $$

In [57]:

# Gradients with respect to a vector

# Chain Rule

import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float64)
y = torch.zeros(2)
y[0] = x[0] + 2*x[1]
y[1] = x[1] + 2*x[2]
L = y.sum()
dL = 1
dy = torch.tensor([1, 1])
dx = torch.tensor([[1, 0],[2, 1], [0, 2]]).matmul(dy)
L.backward()
print(f'dL/dx: {dx}')
print(f'Pytorch autograd {x.grad}')

#

dL/dx: tensor([1, 3, 2])
Pytorch autograd tensor([1., 3., 2.], dtype=torch.float64)

Local Gradient ¶

... ---> $\textbf{x}$ ---> $f(\textbf{x})$ ---> $\textbf{y}$ ---> ... ---> L

$$ \frac{\partial{L}}{\partial{\textbf{x}}} = J^{\textbf{y}}_\textbf{x}\frac{\partial{L}}{\partial{\textbf{y}}} $$

$J^{\textbf{y}}_\textbf{x}$ is a Jacobian matrix, and it has the same number of rows as $\textbf{x}$

Common Computation Blocks - Forward and Backward Pass ¶

Vector dot product ¶

Forward pass: $$ z = x^Ty $$

Backward pass: $$ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{z}}y \\ \frac{\partial{L}}{\partial{y}} = \frac{\partial{L}}{\partial{z}}x $$ Where $[x]_{nx1}, [y]_{nx1}$

In [58]:

# Vector dot product
import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float64)
y = torch.tensor([3, 2, 1], requires_grad=True, dtype=torch.float64)
z = x.T.matmul(y)
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz: {z.grad}')
print(f'Pytorch autograd dx: {x.grad}')
print(f'Pytorch autograd dy: {y.grad}')

dx = z.grad * y
assert(torch.allclose(dx, x.grad))
dy = z.grad * x
assert(torch.all(torch.isclose(dy, y.grad)))
#

Pytorch autograd dz: 1.0
Pytorch autograd dx: tensor([3., 2., 1.], dtype=torch.float64)
Pytorch autograd dy: tensor([1., 2., 3.], dtype=torch.float64)

Vector product ¶

Forward pass: $$ Z = xy $$

Backward pass: $$ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{Z}}y^T \\ \frac{\partial{L}}{\partial{y}} = x^T\frac{\partial{L}}{\partial{Z}} $$ Where $[x]_{mx1}, [y]_{1xn}, [Z]_{mxn}$

In [59]:

# Vector product

import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float64).reshape(3,1)
y = torch.tensor([3, 2], requires_grad=True, dtype=torch.float64).reshape(1,2)
z = x.matmul(y)
x.retain_grad()
y.retain_grad()
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz:\n {z.grad}')
print(f'Pytorch autograd dx:\n {x.grad}')
print(f'Pytorch autograd dy:\n {y.grad}')

dx = z.grad.matmul(y.T)
assert(torch.allclose(dx, x.grad))
dy = x.T.matmul(z.grad)
assert(torch.all(torch.isclose(dy, y.grad)))
#

Pytorch autograd dz:
 tensor([[1., 1.],
        [1., 1.],
        [1., 1.]], dtype=torch.float64)
Pytorch autograd dx:
 tensor([[5.],
        [5.],
        [5.]], dtype=torch.float64)
Pytorch autograd dy:
 tensor([[6., 6.]], dtype=torch.float64)

Vector & Matrix product ¶

Forward pass: $$ Z = Wx $$

Backward pass: $$ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{Z}} \\ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{Z}}x^T $$ Where $[x]_{Dx1}, [W]_{CxD}, [z]_{Cx1}$

In [60]:

# Vector product

import torch
x = torch.tensor([1, 2], requires_grad=True, dtype=torch.float64).reshape(2,1)
w = torch.tensor([[1, 2],[2, 1]], requires_grad=True, dtype=torch.float64).reshape(2,2)
z = w.matmul(x)
x.retain_grad()
w.retain_grad()
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz:\n {z.grad}')
print(f'Pytorch autograd dx:\n {x.grad}')
print(f'Pytorch autograd dy:\n {w.grad}')

dx = w.T.matmul(z.grad)
assert(torch.all(torch.isclose(dx, x.grad)))

dw = z.grad.matmul(x.T)
assert(torch.allclose(dw, w.grad))
#

Pytorch autograd dz:
 tensor([[1.],
        [1.]], dtype=torch.float64)
Pytorch autograd dx:
 tensor([[3.],
        [3.]], dtype=torch.float64)
Pytorch autograd dy:
 tensor([[1., 2.],
        [1., 2.]], dtype=torch.float64)

Matrix & Matrix product ¶

Forward pass: $$ Z = XW $$

Backward pass: $$ \frac{\partial{L}}{\partial{X}} = \frac{\partial{L}}{\partial{Z}}W^T \\ \frac{\partial{L}}{\partial{W}} = X^T\frac{\partial{L}}{\partial{Z}} $$ Where $[x]_{NxD}, [W]_{DxC}, [z]_{NxC}$

In [61]:

# Vector product

import torch
x = torch.tensor([[1, 2, 3], [3, 2, 1]], requires_grad=True, dtype=torch.float64).reshape(2,3)
w = torch.tensor([[1, 2, 3],[2, 1, 3], [3, 1, 2]], requires_grad=True, dtype=torch.float64).reshape(3,3)
z = x.matmul(w)
x.retain_grad()
w.retain_grad()
z.retain_grad()
L = z.sum()
L.backward()
print(f'Pytorch autograd dz:\n {z.grad}')
print(f'Pytorch autograd dx:\n {x.grad}')
print(f'Pytorch autograd dy:\n {w.grad}')

dw = x.T.matmul(z.grad)
assert(torch.all(torch.isclose(dw, w.grad)))

dx = z.grad.matmul(w.T)
assert(torch.allclose(dx, x.grad))
#

Pytorch autograd dz:
 tensor([[1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)
Pytorch autograd dx:
 tensor([[6., 6., 6.],
        [6., 6., 6.]], dtype=torch.float64)
Pytorch autograd dy:
 tensor([[4., 4., 4.],
        [4., 4., 4.],
        [4., 4., 4.]], dtype=torch.float64)

Softmax ¶

Forward pass: $$ p_i = \frac{e^{s_i}}{\sum_i{e^{s_i}}} \\ t = \sum_i{e^{s_i}} $$

Backward pass: $$ \frac{\partial{L}}{\partial{s}} = J^p_s\frac{\partial{L}}{\partial{p}} \\ \begin{cases} -p_ip_j &\text{$i \neq j$}\\ p_i(1-p_i) &\text{otherwise} \end{cases} $$ Where $[s]_{Cx1}, [p]_{Cx1}$

In [62]:

# Vector product

import torch
s = torch.tensor([[1, 2]], requires_grad=True, dtype=torch.float64).reshape(2,1)
s_exp = torch.exp(s)
t = torch.sum(s_exp)
p = s_exp / t
s.retain_grad()
p.retain_grad()
# log likelihood
L = -torch.sum(torch.log(p))
L.backward()
print(f'Pytorch autograd dp:\n {p.grad}')
print(f'Pytorch autograd ds:\n {s.grad}')

J = torch.zeros(2, 2, dtype=p.dtype)
J[0, 0] = p[0]*(1-p[0]) # i = 0, i = 0
J[1, 0] = -p[0]*p[1]    # i = 0, j = 1
J[0, 1] = -p[1]*p[0]     # i = 1, j = 0
J[1, 1] = p[1]*(1-p[1]) # i = 0, i = 1
ds = J.matmul(p.grad)

assert(torch.allclose(ds, s.grad))
#

Pytorch autograd dp:
 tensor([[-3.7183],
        [-1.3679]], dtype=torch.float64)
Pytorch autograd ds:
 tensor([[-0.4621],
        [ 0.4621]], dtype=torch.float64)

Computation Graph ¶

Introduction ¶

The idea of computation graph is to divide the whole computation into small blocks. During the forward pass, we calculate the block outputs and record the outputs for calculating gradients in backward pass. In the backward pass, we calculate the gradients. This is exactly the same idea of the chain rule, but make it more clear when apply the chain rule.

For example, if we have a computation chain is like $ L = (2x + 3y)z$, where $x=1$, $y =2$, $z=3$, we can break it down as: $$ a = 2x \\ b = 3y \\ c = a+b \\ L = cz $$

In the forward pass, we get: $$ a = 2x = 2 \\ b = 3y = 6\\ c = a+b = 8\\ L = cz = 24 $$

In the backward pass, we get: $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{z}} = \frac{\partial{L}}{\partial{L}}\frac{\partial{L}}{\partial{z}} = 1\times8=8\\ \frac{\partial{L}}{\partial{c}} = \frac{\partial{L}}{\partial{L}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{a}} = \frac{\partial{c}}{\partial{a}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{b}} = \frac{\partial{c}}{\partial{b}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{x}} = \frac{\partial{a}}{\partial{x}}\frac{\partial{L}}{\partial{a}} = 2\times3=6\\ \frac{\partial{L}}{\partial{y}} = \frac{\partial{b}}{\partial{y}}\frac{\partial{L}}{\partial{b}} = 3\times3=9\\ $$

Cross Entropy Loss ¶

The mathematical expression of Cross Entropy Loss is described here .

A computation graph is shown as below that $[]$ are the outputs of block operation $()$.

$[W]_{C \times D}$ -\

     \

     (Wx) -> [s]_{Cx1} -> (softmax(s)) -> [p]_{Cx1} -> (cross_entropy_loss(p, label=c)) -> [loss]_{1x1}

     /

$[x]_{D \times 1}$ -/

Consider a single case that $[W]_{3x2}$, $[x]_{2x1}$ and correct label index is $0$.

For the forward pass, we can get: $$ [s]_{3x1} = Wx \\ [p]_{3x1} = softmax(s) \\ p_s = p[0] = \begin{bmatrix}1, 0, 0\end{bmatrix}p\\ L = -log(p_s) $$

For the backward pass, we can derive the intermediate gradients one by one. $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{p_s}} = -\frac{1}{p_s} \\ \frac{\partial{L}}{\partial{p}} = -\frac{1}{p_s} \begin{bmatrix}1 \\ 0\\ 0\end{bmatrix} = \begin{bmatrix}-\frac{1}{p_s} \\ 0\\ 0\end{bmatrix} $$ Based on Softmax :

$$ \frac{\partial{L}}{\partial{s}} = J^p_s\frac{\partial{L}}{\partial{p}} = \begin{bmatrix} &p_0(1-p_0), &-p_0p_1, &-p_0p_2 \\ &-p_1p_0, &p_1(1-p_1), &-p_1p_2\\ &-p_2p_0, &-p_2p_1, &p_2(1-p_2) \end{bmatrix} \begin{bmatrix}-\frac{1}{p_s} \\ 0\\ 0\end{bmatrix} = \begin{bmatrix}p_0-1 \\ p_1\\ p_2\end{bmatrix} $$

Based on Vector & Matrix product : $$ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{s}}x^T \\ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{s}} \\ $$

In [63]:

# Cross Entropy Loss

import torch
torch.set_printoptions(precision=4)

W = torch.randn(3, 2, requires_grad = True)
x = torch.randn(2, requires_grad = True)
c = torch.tensor(0)

# Forward
s = torch.matmul(W, x)
s.retain_grad()
# Numeric trick to prevent overflow of using exp().
exp_s = torch.exp(s - torch.max(s)) 
p = exp_s / torch.sum(exp_s)
loss = -torch.log(p[c])
print(f'\nProbabily distribution\n {p}')
print(f'\nCross entropy loss is {loss:.4f}')

# Backward
loss.backward()
ds = torch.tensor([p[0]-1, p[1], p[2]])
dW = ds.view(-1,1).matmul(x.view(1,-1))
dx = W.T.matmul(ds)
print(f'\nGradients of s: \n {ds}')
assert(torch.allclose(ds, s.grad))
print(f'\nGradients of W: \n {dW}')
assert(torch.allclose(dW, W.grad))
print(f'\nGradients of x: \n {dx}')

#

Probabily distribution
 tensor([0.0391, 0.6168, 0.3442], grad_fn=<DivBackward0>)

Cross entropy loss is 3.2429

Gradients of s: 
 tensor([-0.9609,  0.6168,  0.3442])

Gradients of W: 
 tensor([[-0.9903,  0.2445],
        [ 0.6356, -0.1569],
        [ 0.3547, -0.0876]], grad_fn=<MmBackward>)

Gradients of x: 
 tensor([ 2.1871, -0.7751], grad_fn=<MvBackward>)

Hinge Loss ¶

The mathematical expression of Cross Entropy Loss is described here .

A computation graph is shown as below that $[]$ are the outputs of block operation $()$.

$[W]_{C \times D}$ -\

     \

     (Wx)->[s]_{Cx1}->(s_i-s_c+1, 0)->[t]->(max(t, 0)) ->[z] ->(sum(z)) ->[loss]_{1x1}

     /

$[x]_{D \times 1}$ -/

Consider a single case that $[W]_{3x2}$, $[x]_{2x1}$ and correct label index is $0$.

For the forward pass, we can get: $$ [s]_{3x1} = Wx \\ [t]_{3x1} = \begin{bmatrix}0 \\ s_1-s_0+1\\ s_2-s_0+1\end{bmatrix} \\ [z]_{3x1} = \begin{bmatrix}0 \\ m(s_1-s_0+1) \\ n(s_2-s_0+1) \end{bmatrix} \\ L = z_1 + z_2 $$ Where $m=0$ if $s_1-s_0+1<=0$, otherwise $m=1$, $n=0$ if $s_2-s_0+1<=0$, otherwise $n=1$.

For the backward pass, we can derive the intermediate gradients one by one. $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{z}} = \begin{bmatrix}0 \\ 1\\ 1\end{bmatrix} \\ \frac{\partial{L}}{\partial{t}} = J^z_t\frac{\partial{L}}{\partial{z}}= \begin{bmatrix} &0, &0, &0 \\ &0, &m, &0 \\ &0, &0, &n \end{bmatrix} \begin{bmatrix}0 \\ 1\\ 1\end{bmatrix} = \begin{bmatrix}0 \\ m \\ n\end{bmatrix} \\ \frac{\partial{L}}{\partial{s}} = J^t_s\frac{\partial{L}}{\partial{t}} = \begin{bmatrix} &0, &-1, &-1 \\ &0, &1, &0 \\ &0, &0, &1 \end{bmatrix} \begin{bmatrix}0 \\ m \\ n\end{bmatrix} = \begin{bmatrix}-(m+n) \\ m \\ n\end{bmatrix} $$ Based on Vector & Matrix product : $$ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{s}}x^T \\ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{s}} \\ $$

In [64]:

# Cross Entropy Loss

import torch
torch.set_printoptions(precision=4)

W = torch.tensor([[1., 2.], [3., 4.], [1., 3.]], requires_grad = True)
x = torch.tensor([2., 1.],  requires_grad = True)
c = torch.tensor(0)

# Forward
s = torch.matmul(W, x)
s.retain_grad()
t = s-s[c]+1
t[c] = 0
z = torch.clamp(t, 0)
m = float(t[1] > 0)
n = float(t[2] > 0)
loss = z[1] + z[2]
print(f'\Hinge loss is {loss:.4f}')

# Backward
loss.backward()
ds = torch.tensor([-m-n, m, n])
dW = ds.view(-1,1).matmul(x.view(1,-1))
dx = W.T.matmul(ds)
print(f'\nGradients of s: \n {ds}')
assert(torch.allclose(ds, s.grad))

print(f'\nGradients of W: \n {dW}')
assert(torch.allclose(dW, W.grad))
print(f'\nGradients of x: \n {dx}')

#

\Hinge loss is 9.0000

Gradients of s: 
 tensor([-2.,  1.,  1.])

Gradients of W: 
 tensor([[-4., -2.],
        [ 2.,  1.],
        [ 2.,  1.]], grad_fn=<MmBackward>)

Gradients of x: 
 tensor([2., 3.], grad_fn=<MvBackward>)

Xipeng Wang

Machine Learning Notes - Backpropagation

Backpropagation Introduction ¶

Chain Rule ¶

Local Gradient ¶

Common Computation Blocks - Forward and Backward Pass ¶

Vector dot product ¶

Vector product ¶

Vector & Matrix product ¶

Matrix & Matrix product ¶

Softmax ¶

Computation Graph ¶

Introduction ¶

Cross Entropy Loss ¶

Hinge Loss ¶