Back propagation

Backpropagation is the fundamental algorithm that enables neural networks to learn. It efficiently computes the gradient of the loss function with respect to each weight and bias parameter in a neural network. These gradients indicate how much each parameter contributes to the error, allowing us to update them using gradient descent to minimize the loss.

Simple Feedforward Network Setup

Architecture:

Input layer → Hidden layer (sigmoid activation) → Output layer (sigmoid activation)
Loss function: Mean Squared Error (MSE)

For clarity, let's assume:

Input dimension: n
Hidden layer dimension: h
Output dimension: m

Notation

Let's define our variables clearly:

$x \in \mathbb{R}^n$ : Input vector (n-dimensional)
$W_1 \in \mathbb{R}^{h \times n}$ : Weight matrix connecting input to hidden layer
$b_1 \in \mathbb{R}^h$ : Bias vector for hidden layer
$z_1 = W_1 x + b_1 \in \mathbb{R}^h$ : Pre-activation values for hidden layer
$a_1 = \sigma(z_1) \in \mathbb{R}^h$ : Activation values for hidden layer
$W_2 \in \mathbb{R}^{m \times h}$ : Weight matrix connecting hidden to output layer
$b_2 \in \mathbb{R}^m$ : Bias vector for output layer
$z_2 = W_2 a_1 + b_2 \in \mathbb{R}^m$ : Pre-activation values for output layer
$a_2 = \sigma(z_2) \in \mathbb{R}^m$ : Activation values for output layer (final output)
$y \in \mathbb{R}^m$ : True output (target)
$\mathcal{L} = \frac{1}{2} \| a_2 - y \|^2$ : Loss function (Mean Squared Error)

Where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid activation function.

Forward Pass (Step-by-Step)

Input Layer:
- Take input vector $x \in \mathbb{R}^n$
Hidden Layer Computation:
- Compute weighted sum plus bias: $z_1 = W_1 x + b_1$ $z_{1} = W_{1} x + b_{1}$
  - $W_1 x$ is a matrix-vector multiplication
  - Each element $(z_1)_i = \sum_{j=1}^{n} (W_1)_{ij} x_j + (b_1)_i$
- Apply activation function: $a_1 = \sigma(z_1)$ $a_{1} = σ (z_{1})$
  - Applied element-wise: $(a_1)_i = \sigma((z_1)_i) = \frac{1}{1 + e^{-(z_1)_i}}$
Output Layer Computation:
- Compute weighted sum plus bias: $z_2 = W_2 a_1 + b_2$ $z_{2} = W_{2} a_{1} + b_{2}$
  - Each element $(z_2)_i = \sum_{j=1}^{h} (W_2)_{ij} (a_1)_j + (b_2)_i$
- Apply activation function: $a_2 = \sigma(z_2)$ $a_{2} = σ (z_{2})$
  - Applied element-wise: $(a_2)_i = \sigma((z_2)_i) = \frac{1}{1 + e^{-(z_2)_i}}$
Loss Calculation:
- Compute MSE loss: $\mathcal{L} = \frac{1}{2} \| a_2 - y \|^2 = \frac{1}{2} \sum_{i=1}^{m} ((a_2)_i - y_i)^2$

Backpropagation Steps

Backpropagation uses the chain rule of calculus to compute gradients, working backward from the output to the input.

Step 1: Gradient w.r.t. Output Activation

We begin by computing how the loss changes with respect to the output activations:

\frac{\partial \mathcal{L}}{\partial a_2} = a_2 - y

Derivation: $\mathcal{L} = \frac{1}{2} \| a_2 - y \|^2 = \frac{1}{2} \sum_{i=1}^{m} ((a_2)_i - y_i)^2$

Taking the derivative with respect to $(a_2)_j$ : $\frac{\partial \mathcal{L}}{\partial (a_2)_j} = \frac{\partial}{\partial (a_2)_j} \left( \frac{1}{2} \sum_{i=1}^{m} ((a_2)_i - y_i)^2 \right)$

Only the term where $i = j$ contributes to the derivative: $\frac{\partial \mathcal{L}}{\partial (a_2)_j} = \frac{1}{2} \cdot 2 \cdot ((a_2)_j - y_j) \cdot 1 = (a_2)_j - y_j$

Therefore, in vector form: $\frac{\partial \mathcal{L}}{\partial a_2} = a_2 - y$

Step 2: Gradient w.r.t. Output Pre-activation

Next, we compute how the loss changes with respect to the pre-activation values of the output layer:

\frac{\partial \mathcal{L}}{\partial z_2} = \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} = (a_2 - y) \odot \sigma'(z_2)

Derivation: Using the chain rule, we have: $\frac{\partial \mathcal{L}}{\partial z_2} = \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2}$

We already computed $\frac{\partial \mathcal{L}}{\partial a_2} = a_2 - y$

For $\frac{\partial a_2}{\partial z_2}$ , recall that $a_2 = \sigma(z_2)$ . The derivative of the sigmoid function is: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

Therefore: $\frac{\partial a_2}{\partial z_2} = \sigma'(z_2) = \sigma(z_2) \odot (1 - \sigma(z_2)) = a_2 \odot (1 - a_2)$

Combining these, we get: $\frac{\partial \mathcal{L}}{\partial z_2} = (a_2 - y) \odot \sigma'(z_2) = (a_2 - y) \odot a_2 \odot (1 - a_2)$

Where $\odot$ denotes element-wise multiplication.

Step 3: Gradients for Output Weights and Bias

Now we compute the gradients with respect to the weights and biases of the output layer:

\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot a_1^T

\frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2}

Derivation:

For weights: $\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2}$

Recall that $z_2 = W_2 a_1 + b_2$ . For a specific weight $(W_2)_{ij}$ : $\frac{\partial (z_2)_i}{\partial (W_2)_{ij}} = (a_1)_j$

This can be written in matrix form as: $\frac{\partial z_2}{\partial W_2} = a_1^T$

Therefore: $\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot a_1^T$

This means each element $\frac{\partial \mathcal{L}}{\partial (W_2)_{ij}} = \frac{\partial \mathcal{L}}{\partial (z_2)_i} \cdot (a_1)_j$

For biases: $\frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2}$

Since $\frac{\partial z_2}{\partial b_2} = 1$ (element-wise), we have: $\frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2}$

Step 4: Backprop to Hidden Layer

Next, we compute how the loss changes with respect to the hidden layer activations and pre-activations:

\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2}

\frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} = \left( \frac{\partial \mathcal{L}}{\partial a_1} \right) \odot \sigma'(z_1)

Derivation:

For hidden activations: $\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1}$

Recall that $z_2 = W_2 a_1 + b_2$ . We have: $\frac{\partial (z_2)_i}{\partial (a_1)_j} = (W_2)_{ij}$

In matrix form: $\frac{\partial z_2}{\partial a_1} = W_2$

However, due to the way gradients flow in a computational graph, we need the transpose: $\frac{\partial \mathcal{L}}{\partial a_1} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2}$

For hidden pre-activations: $\frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1}$

Since $a_1 = \sigma(z_1)$ , we have: $\frac{\partial a_1}{\partial z_1} = \sigma'(z_1) = \sigma(z_1) \odot (1 - \sigma(z_1)) = a_1 \odot (1 - a_1)$

Therefore: $\frac{\partial \mathcal{L}}{\partial z_1} = \left( \frac{\partial \mathcal{L}}{\partial a_1} \right) \odot \sigma'(z_1) = \left( W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2} \right) \odot \sigma'(z_1)$

Step 5: Gradients for Hidden Weights and Bias

Finally, we compute the gradients for the weights and biases of the hidden layer:

\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot x^T

\frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1}

Derivation:

For weights: $\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}$

Recall that $z_1 = W_1 x + b_1$ . For a specific weight $(W_1)_{ij}$ : $\frac{\partial (z_1)_i}{\partial (W_1)_{ij}} = x_j$

In matrix form: $\frac{\partial z_1}{\partial W_1} = x^T$

Therefore: $\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot x^T$

This means each element $\frac{\partial \mathcal{L}}{\partial (W_1)_{ij}} = \frac{\partial \mathcal{L}}{\partial (z_1)_i} \cdot x_j$

For biases: $\frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1}$

Since $\frac{\partial z_1}{\partial b_1} = 1$ (element-wise), we have: $\frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1}$

Gradient Descent Update Rule

Once we have computed all the gradients, we update the weights and biases using gradient descent:

W \leftarrow W - \eta \cdot \frac{\partial \mathcal{L}}{\partial W}

b \leftarrow b - \eta \cdot \frac{\partial \mathcal{L}}{\partial b}

Where $\eta$ is the learning rate, a hyperparameter that controls the step size.

Specifically:

$W_2 \leftarrow W_2 - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_2}$
$b_2 \leftarrow b_2 - \eta \cdot \frac{\partial \mathcal{L}}{\partial b_2}$
$W_1 \leftarrow W_1 - \eta \cdot \frac{\partial \mathcal{L}}{\partial W_1}$
$b_1 \leftarrow b_1 - \eta \cdot \frac{\partial \mathcal{L}}{\partial b_1}$

The negative sign is because we want to move in the direction of decreasing loss.

Computational Efficiency

Backpropagation is computationally efficient because:

It reuses calculations (intermediate derivatives) as it propagates backward
It avoids redundant calculations by computing gradients in a specific order
Most operations are vectorized matrix operations, which can be efficiently implemented

Extension to Deep Networks

The same principles extend to networks with more layers:

Perform a forward pass, storing all intermediate activations
Compute the output error
Propagate the error backward layer by layer
Update all weights and biases using their respective gradients