Backpropagation

Backpropagation#

Backpropagation is an algorithm for gradient descent in multi-layer artificial neural networks. It uses the chain rule to calculate the gradient of the loss function with respect to the weights of each layer in the network, in order to update the weights and minimize the loss function.

Simple Example Calculation#

Symbol	Meaning
$ w_{ij} $	Weight
$ z_i $	Input
$ y_i $	Output
$ E = \frac{1}{2}(y_p-y_a)^2$	Loss
$ f(x) = \frac{1}{1+e^{-x}}$	Activation function

For example, updating $w_{53}$:

The main idea is that we cannot find a direct relationship between $E$ and $w_{53}$, so we use the chain rule to indirectly calculate the gradient:

w_{53}(new) = w_{53}(old) - \Delta w_{53}

\begin{cases} E = \frac{1}{2}(y_5-y_a)^2\\\\ y_5 = f(z_5)\\\\ z_5 = w_{53} * y_3 + w_{54} * y_4 \end{cases}

\begin{align} \Delta w_{53} &= \frac{\partial E}{\partial w_{53}} \\\\ &= \frac{\partial E}{\partial y_{5}} \frac{\partial y_5}{\partial z_5} \frac{\partial z_5}{\partial w_{53}} \end{align}

Code Implementation#

from math import exp, sqrt


def f(x):
    """returns sigmoid(x)"""
    return 1 / (1 + exp(-x))


def d_f(x):
    """return d(sigmoid(x))/dx"""
    return f(x) * (1 - f(x))


def E(yp, ya):
    """return the error in squre"""
    return 1/2*(yp-ya)**2


def d_E(yp, ya):
    """return dE/d(yp)"""
    return yp - ya


ya = 0.5

z = {1: 0.35, 2: 0.9, 3: 0.0, 4: 0.0, 5: 0.0}

w = {(1, 3): 0.1, (2, 3): 0.8, (1, 4): 0.4,
     (2, 4): 0.6, (3, 5): 0.3, (4, 5): 0.9}

y = {i: 0.0 for i in range(1, 6)}


s = {(1, 3): 0.0, (2, 3): 0.0, (1, 4): 0.0,
     (2, 4): 0.0, (3, 5): 0.0, (4, 5): 0.0}


def forward():
    """forward calculation"""
    y[1], y[2] = z[1], z[2]
    z[3], z[4] = w[1, 3] * y[1] + w[2, 3] * \
        y[2], w[1, 4] * y[1] + w[2, 4] * y[2]
    y[3], y[4] = f(z[3]), f(z[4])
    z[5] = w[3, 5] * y[3] + w[4, 5] * y[4]
    y[5] = f(z[5])
    e = E(y[5], ya)
    print(f'yp = {y[5]}, Error = {e}')
    return e


def update(n, g):
    """update weights"""
    w[n] -= g


def back():
    """back propagation"""
    g = d_E(y[5], ya) * d_f(z[5]) * y[3]
    update((3, 5), g)
    g = d_E(y[5], ya) * d_f(z[5]) * y[4]
    update((4, 5), g)
    g = d_E(y[5], ya) * d_f(z[5]) * w[3, 5] * d_f(z[3]) * y[1]
    update((1, 3), g)
    g = d_E(y[5], ya) * d_f(z[5]) * w[3, 5] * d_f(z[3]) * y[2]
    update((2, 3), g)
    g = d_E(y[5], ya) * d_f(z[5]) * w[4, 5] * d_f(z[4]) * y[1]
    update((1, 4), g)
    g = d_E(y[5], ya) * d_f(z[5]) * w[4, 5] * d_f(z[4]) * y[2]
    update((2, 4), g)


if __name__ == '__main__':
    step = 1
    while True:
        print(f'step = {step}:', end=' ')
        if forward() < 1e-7:
            break
        step += 1
        back()

Running the code, we find that it takes 111 iterations to reach the expected loss.

➜ python3 main.py
step = 1: yp = 0.6902834929076443, Error = 0.01810390383656677
step = 2: yp = 0.6820312027460466, Error = 0.016567679386586154
step = 3: yp = 0.6739592936119999, Error = 0.015130917916992996
step = 4: yp = 0.6660860653430867, Error = 0.013792290550574028
...
...
...
step = 110: yp = 0.500455314229855, Error = 1.036555239542297e-07
step = 111: yp = 0.5004302844701667, Error = 9.2572362633313e-08

Optimization with Momentum#

def update(n, g):
    """update weights"""
    lr = 0.1
    beta = 0.999
    epsilon = 0.01
    s[n] = beta * s[n] + (1 - beta) * g**2
    w[n] -= lr / (sqrt(s[n]) + epsilon) * g

The code above shows the optimization with momentum. With this optimization, the loss reaches the expected value after only 11 iterations.

➜ python3 main.py
step = 1: yp = 0.6902834929076443, Error = 0.01810390383656677
step = 2: yp = 0.6118790538896405, Error = 0.006258461349620539
step = 3: yp = 0.5593494607066376, Error = 0.0017611792430843605
step = 4: yp = 0.5301730589054645, Error = 0.00045520674185631563
step = 5: yp = 0.5151484953395415, Error = 0.00011473845552605596
step = 6: yp = 0.5075810359788381, Error = 2.873605325621872e-05
step = 7: yp = 0.5037909668833773, Error = 7.185714955431785e-06
step = 8: yp = 0.5018953359787233, Error = 1.7961492361214424e-06
step = 9: yp = 0.5009475298860605, Error = 4.48906442488917e-07
step = 10: yp = 0.5004736747645289, Error = 1.1218389127573745e-07
step = 11: yp = 0.5002367820786155, Error = 2.803287637675012e-08

Implementation with PyTorch#

import torch
import torch.nn as nn


class Net(nn.Module):

    def __init__(self) -> None:
        super(Net, self).__init__()

        hidden = nn.Linear(2, 2, bias=False)
        output = nn.Linear(2, 1, bias=False)
        hidden.weight.data = torch.tensor([[0.1, 0.8], [0.4, 0.6]])
        output.weight.data = torch.tensor([0.3, 0.9])
        self.net = nn.Sequential(
            hidden,
            nn.Sigmoid(),
            output,
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)


def MSE(pred, act):
    return 1/2*torch.mean((pred - act)**2)


device = "cuda" if torch.cuda.is_available() else "cpu"

x = torch.tensor([0.35, 0.9]).to(device)

y = torch.tensor(0.5).to(device)

net = Net().to(device)

optimizer = torch.optim.RMSprop(
    net.parameters(), lr=0.1, alpha=0.999, eps=1e-2)


def train():
    optimizer.zero_grad()
    pred = net(x)
    loss = MSE(pred, y)
    loss.backward()
    optimizer.step()
    err = loss.detach().item()
    print(f'pred = {pred}, error = {err}')
    return err


if __name__ == '__main__':
    step = 1
    while True:
        print(f'step = {step}:', end=' ')
        if train() < 1e-7:
            break
        step += 1

References#

https://en.wikipedia.org/wiki/Backpropagation