Machine Learning Notes - Backpropagation

Backpropagation Introduction

Chain Rule

$$ y = f_1(x) \\ L = f_2(y) $$

Chain rule:

$$ \frac{\partial{L}}{\partial{y}} = \frac{\partial{f_2(y)}}{\partial{y}} \\ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{x}} = \frac{\partial{f_2(y)}}{\partial{y}} \frac{\partial{f_1(x)}}{\partial{x}} $$
In [56]:
dL/dx: 6
Pytorch autograd tensor([6.], dtype=torch.float64)
$$ \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} = f(\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}) = \begin{bmatrix} f_{1}(x_1, x_2, x_3) \\ f_{2}(x_1, x_2, x_3) \end{bmatrix}\\ L = g(\begin{bmatrix} y_1 \\ y_2 \end{bmatrix}) \\ $$

Chain rule:

$$ \frac{\partial{L}}{\partial{y}} = \begin{bmatrix} \frac{\partial{g(y)}}{\partial{y_1}} \\ \frac{\partial{g(y)}}{\partial{y_2}} \\ \end{bmatrix}\\ \frac{\partial{L}}{\partial{x}} = J_x^y \frac{\partial{L}}{\partial{y}} = \begin{bmatrix} \frac{\partial{f_1(x)}}{\partial{x_1}}, \frac{\partial{f_2(x)}}{\partial{x_1}}\\ \frac{\partial{f_1(x)}}{\partial{x_2}}, \frac{\partial{f_2(x)}}{\partial{x_2}}\\ \frac{\partial{f_1(x)}}{\partial{x_3}}, \frac{\partial{f_2(x)}}{\partial{x_3}} \end{bmatrix} \frac{\partial{L}}{\partial{y}} $$
In [57]:
dL/dx: tensor([1, 3, 2])
Pytorch autograd tensor([1., 3., 2.], dtype=torch.float64)

Local Gradient

... ---> $\textbf{x}$ ---> $f(\textbf{x})$ ---> $\textbf{y}$ ---> ... ---> L

$$ \frac{\partial{L}}{\partial{\textbf{x}}} = J^{\textbf{y}}_\textbf{x}\frac{\partial{L}}{\partial{\textbf{y}}} $$

$J^{\textbf{y}}_\textbf{x}$ is a Jacobian matrix, and it has the same number of rows as $\textbf{x}$

Common Computation Blocks - Forward and Backward Pass

Vector dot product

Forward pass: $$ z = x^Ty $$

Backward pass: $$ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{z}}y \\ \frac{\partial{L}}{\partial{y}} = \frac{\partial{L}}{\partial{z}}x $$ Where $[x]_{nx1}, [y]_{nx1}$

In [58]:
Pytorch autograd dz: 1.0
Pytorch autograd dx: tensor([3., 2., 1.], dtype=torch.float64)
Pytorch autograd dy: tensor([1., 2., 3.], dtype=torch.float64)

Vector product

Forward pass: $$ Z = xy $$

Backward pass: $$ \frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{Z}}y^T \\ \frac{\partial{L}}{\partial{y}} = x^T\frac{\partial{L}}{\partial{Z}} $$ Where $[x]_{mx1}, [y]_{1xn}, [Z]_{mxn}$

In [59]:
Pytorch autograd dz:
 tensor([[1., 1.],
        [1., 1.],
        [1., 1.]], dtype=torch.float64)
Pytorch autograd dx:
 tensor([[5.],
        [5.],
        [5.]], dtype=torch.float64)
Pytorch autograd dy:
 tensor([[6., 6.]], dtype=torch.float64)

Vector & Matrix product

Forward pass: $$ Z = Wx $$

Backward pass: $$ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{Z}} \\ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{Z}}x^T $$ Where $[x]_{Dx1}, [W]_{CxD}, [z]_{Cx1}$

In [60]:
Pytorch autograd dz:
 tensor([[1.],
        [1.]], dtype=torch.float64)
Pytorch autograd dx:
 tensor([[3.],
        [3.]], dtype=torch.float64)
Pytorch autograd dy:
 tensor([[1., 2.],
        [1., 2.]], dtype=torch.float64)

Matrix & Matrix product

Forward pass: $$ Z = XW $$

Backward pass: $$ \frac{\partial{L}}{\partial{X}} = \frac{\partial{L}}{\partial{Z}}W^T \\ \frac{\partial{L}}{\partial{W}} = X^T\frac{\partial{L}}{\partial{Z}} $$ Where $[x]_{NxD}, [W]_{DxC}, [z]_{NxC}$

In [61]:
Pytorch autograd dz:
 tensor([[1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)
Pytorch autograd dx:
 tensor([[6., 6., 6.],
        [6., 6., 6.]], dtype=torch.float64)
Pytorch autograd dy:
 tensor([[4., 4., 4.],
        [4., 4., 4.],
        [4., 4., 4.]], dtype=torch.float64)

Softmax

Forward pass: $$ p_i = \frac{e^{s_i}}{\sum_i{e^{s_i}}} \\ t = \sum_i{e^{s_i}} $$

Backward pass: $$ \frac{\partial{L}}{\partial{s}} = J^p_s\frac{\partial{L}}{\partial{p}} \\ \begin{cases} -p_ip_j &\text{$i \neq j$}\\ p_i(1-p_i) &\text{otherwise} \end{cases} $$ Where $[s]_{Cx1}, [p]_{Cx1}$

In [62]:
Pytorch autograd dp:
 tensor([[-3.7183],
        [-1.3679]], dtype=torch.float64)
Pytorch autograd ds:
 tensor([[-0.4621],
        [ 0.4621]], dtype=torch.float64)

Computation Graph

Introduction

The idea of computation graph is to divide the whole computation into small blocks. During the forward pass, we calculate the block outputs and record the outputs for calculating gradients in backward pass. In the backward pass, we calculate the gradients. This is exactly the same idea of the chain rule, but make it more clear when apply the chain rule.

For example, if we have a computation chain is like $ L = (2x + 3y)z$, where $x=1$, $y =2$, $z=3$, we can break it down as: $$ a = 2x \\ b = 3y \\ c = a+b \\ L = cz $$

In the forward pass, we get: $$ a = 2x = 2 \\ b = 3y = 6\\ c = a+b = 8\\ L = cz = 24 $$

In the backward pass, we get: $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{z}} = \frac{\partial{L}}{\partial{L}}\frac{\partial{L}}{\partial{z}} = 1\times8=8\\ \frac{\partial{L}}{\partial{c}} = \frac{\partial{L}}{\partial{L}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{a}} = \frac{\partial{c}}{\partial{a}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{b}} = \frac{\partial{c}}{\partial{b}}\frac{\partial{L}}{\partial{c}} = 1\times3=3\\ \frac{\partial{L}}{\partial{x}} = \frac{\partial{a}}{\partial{x}}\frac{\partial{L}}{\partial{a}} = 2\times3=6\\ \frac{\partial{L}}{\partial{y}} = \frac{\partial{b}}{\partial{y}}\frac{\partial{L}}{\partial{b}} = 3\times3=9\\ $$

Cross Entropy Loss

The mathematical expression of Cross Entropy Loss is described here .

A computation graph is shown as below that $[]$ are the outputs of block operation $()$.

$[W]_{C \times D}$ -\

     \

     (Wx) -> [s]_{Cx1} -> (softmax(s)) -> [p]_{Cx1} -> (cross_entropy_loss(p, label=c)) -> [loss]_{1x1}

     / 

$[x]_{D \times 1}$ -/

Consider a single case that $[W]_{3x2}$, $[x]_{2x1}$ and correct label index is $0$.

For the forward pass, we can get: $$ [s]_{3x1} = Wx \\ [p]_{3x1} = softmax(s) \\ p_s = p[0] = \begin{bmatrix}1, 0, 0\end{bmatrix}p\\ L = -log(p_s) $$

For the backward pass, we can derive the intermediate gradients one by one. $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{p_s}} = -\frac{1}{p_s} \\ \frac{\partial{L}}{\partial{p}} = -\frac{1}{p_s} \begin{bmatrix}1 \\ 0\\ 0\end{bmatrix} = \begin{bmatrix}-\frac{1}{p_s} \\ 0\\ 0\end{bmatrix} $$ Based on Softmax :

$$ \frac{\partial{L}}{\partial{s}} = J^p_s\frac{\partial{L}}{\partial{p}} = \begin{bmatrix} &p_0(1-p_0), &-p_0p_1, &-p_0p_2 \\ &-p_1p_0, &p_1(1-p_1), &-p_1p_2\\ &-p_2p_0, &-p_2p_1, &p_2(1-p_2) \end{bmatrix} \begin{bmatrix}-\frac{1}{p_s} \\ 0\\ 0\end{bmatrix} = \begin{bmatrix}p_0-1 \\ p_1\\ p_2\end{bmatrix} $$

Based on Vector & Matrix product : $$ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{s}}x^T \\ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{s}} \\ $$

In [63]:
Probabily distribution
 tensor([0.0391, 0.6168, 0.3442], grad_fn=<DivBackward0>)

Cross entropy loss is 3.2429

Gradients of s: 
 tensor([-0.9609,  0.6168,  0.3442])

Gradients of W: 
 tensor([[-0.9903,  0.2445],
        [ 0.6356, -0.1569],
        [ 0.3547, -0.0876]], grad_fn=<MmBackward>)

Gradients of x: 
 tensor([ 2.1871, -0.7751], grad_fn=<MvBackward>)

Hinge Loss

The mathematical expression of Cross Entropy Loss is described here .

A computation graph is shown as below that $[]$ are the outputs of block operation $()$.

$[W]_{C \times D}$ -\

     \

     (Wx)->[s]_{Cx1}->(s_i-s_c+1, 0)->[t]->(max(t, 0)) ->[z] ->(sum(z)) ->[loss]_{1x1}

     / 

$[x]_{D \times 1}$ -/

Consider a single case that $[W]_{3x2}$, $[x]_{2x1}$ and correct label index is $0$.

For the forward pass, we can get: $$ [s]_{3x1} = Wx \\ [t]_{3x1} = \begin{bmatrix}0 \\ s_1-s_0+1\\ s_2-s_0+1\end{bmatrix} \\ [z]_{3x1} = \begin{bmatrix}0 \\ m(s_1-s_0+1) \\ n(s_2-s_0+1) \end{bmatrix} \\ L = z_1 + z_2 $$ Where $m=0$ if $s_1-s_0+1<=0$, otherwise $m=1$, $n=0$ if $s_2-s_0+1<=0$, otherwise $n=1$.

For the backward pass, we can derive the intermediate gradients one by one. $$ \frac{\partial{L}}{\partial{L}} = 1 \\ \frac{\partial{L}}{\partial{z}} = \begin{bmatrix}0 \\ 1\\ 1\end{bmatrix} \\ \frac{\partial{L}}{\partial{t}} = J^z_t\frac{\partial{L}}{\partial{z}}= \begin{bmatrix} &0, &0, &0 \\ &0, &m, &0 \\ &0, &0, &n \end{bmatrix} \begin{bmatrix}0 \\ 1\\ 1\end{bmatrix} = \begin{bmatrix}0 \\ m \\ n\end{bmatrix} \\ \frac{\partial{L}}{\partial{s}} = J^t_s\frac{\partial{L}}{\partial{t}} = \begin{bmatrix} &0, &-1, &-1 \\ &0, &1, &0 \\ &0, &0, &1 \end{bmatrix} \begin{bmatrix}0 \\ m \\ n\end{bmatrix} = \begin{bmatrix}-(m+n) \\ m \\ n\end{bmatrix} $$ Based on Vector & Matrix product : $$ \frac{\partial{L}}{\partial{W}} = \frac{\partial{L}}{\partial{s}}x^T \\ \frac{\partial{L}}{\partial{x}} = W^T\frac{\partial{L}}{\partial{s}} \\ $$

In [64]:
\Hinge loss is 9.0000

Gradients of s: 
 tensor([-2.,  1.,  1.])

Gradients of W: 
 tensor([[-4., -2.],
        [ 2.,  1.],
        [ 2.,  1.]], grad_fn=<MmBackward>)

Gradients of x: 
 tensor([2., 3.], grad_fn=<MvBackward>)
This blog is converted from machine-learning-backpropagation.ipynb
Written on May 7, 2021