Back propagation

Backpropagation is the fundamental algorithm that enables neural networks to learn. It efficiently computes the gradient of the loss function with respect to each weight and bias parameter in a neural network. These gradients indicate how much each parameter contributes to the error, allowing us to update them using gradient descent to minimize the loss.

Simple Feedforward Network Setup

Architecture:

For clarity, let's assume:

Notation

Let's define our variables clearly:

Where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid activation function.

Forward Pass (Step-by-Step)

  1. Input Layer:

    • Take input vector xRnx \in \mathbb{R}^n
  2. Hidden Layer Computation:

    • Compute weighted sum plus bias: z1=W1x+b1z_1 = W_1 x + b_1
      • W1xW_1 x is a matrix-vector multiplication
      • Each element (z1)i=j=1n(W1)ijxj+(b1)i(z_1)_i = \sum_{j=1}^{n} (W_1)_{ij} x_j + (b_1)_i
    • Apply activation function: a1=σ(z1)a_1 = \sigma(z_1)
      • Applied element-wise: (a1)i=σ((z1)i)=11+e(z1)i(a_1)_i = \sigma((z_1)_i) = \frac{1}{1 + e^{-(z_1)_i}}
  3. Output Layer Computation:

    • Compute weighted sum plus bias: z2=W2a1+b2z_2 = W_2 a_1 + b_2
      • Each element (z2)i=j=1h(W2)ij(a1)j+(b2)i(z_2)_i = \sum_{j=1}^{h} (W_2)_{ij} (a_1)_j + (b_2)_i
    • Apply activation function: a2=σ(z2)a_2 = \sigma(z_2)
      • Applied element-wise: (a2)i=σ((z2)i)=11+e(z2)i(a_2)_i = \sigma((z_2)_i) = \frac{1}{1 + e^{-(z_2)_i}}
  4. Loss Calculation:

    • Compute MSE loss: L=12a2y2=12i=1m((a2)iyi)2\mathcal{L} = \frac{1}{2} \| a_2 - y \|^2 = \frac{1}{2} \sum_{i=1}^{m} ((a_2)_i - y_i)^2

Backpropagation Steps

Backpropagation uses the chain rule of calculus to compute gradients, working backward from the output to the input.

Step 1: Gradient w.r.t. Output Activation

We begin by computing how the loss changes with respect to the output activations:

La2=a2y\frac{\partial \mathcal{L}}{\partial a_2} = a_2 - y

Derivation: L=12a2y2=12i=1m((a2)iyi)2\mathcal{L} = \frac{1}{2} \| a_2 - y \|^2 = \frac{1}{2} \sum_{i=1}^{m} ((a_2)_i - y_i)^2

Taking the derivative with respect to (a2)j(a_2)_j: L(a2)j=(a2)j(12i=1m((a2)iyi)2)\frac{\partial \mathcal{L}}{\partial (a_2)_j} = \frac{\partial}{\partial (a_2)_j} \left( \frac{1}{2} \sum_{i=1}^{m} ((a_2)_i - y_i)^2 \right)

Only the term where i=ji = j contributes to the derivative: L(a2)j=122((a2)jyj)1=(a2)jyj\frac{\partial \mathcal{L}}{\partial (a_2)_j} = \frac{1}{2} \cdot 2 \cdot ((a_2)_j - y_j) \cdot 1 = (a_2)_j - y_j

Therefore, in vector form: La2=a2y\frac{\partial \mathcal{L}}{\partial a_2} = a_2 - y

Step 2: Gradient w.r.t. Output Pre-activation

Next, we compute how the loss changes with respect to the pre-activation values of the output layer:

Lz2=La2a2z2=(a2y)σ(z2)\frac{\partial \mathcal{L}}{\partial z_2} = \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} = (a_2 - y) \odot \sigma'(z_2)

Derivation: Using the chain rule, we have: Lz2=La2a2z2\frac{\partial \mathcal{L}}{\partial z_2} = \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2}

We already computed La2=a2y\frac{\partial \mathcal{L}}{\partial a_2} = a_2 - y

For a2z2\frac{\partial a_2}{\partial z_2}, recall that a2=σ(z2)a_2 = \sigma(z_2). The derivative of the sigmoid function is: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

Therefore: a2z2=σ(z2)=σ(z2)(1σ(z2))=a2(1a2)\frac{\partial a_2}{\partial z_2} = \sigma'(z_2) = \sigma(z_2) \odot (1 - \sigma(z_2)) = a_2 \odot (1 - a_2)

Combining these, we get: Lz2=(a2y)σ(z2)=(a2y)a2(1a2)\frac{\partial \mathcal{L}}{\partial z_2} = (a_2 - y) \odot \sigma'(z_2) = (a_2 - y) \odot a_2 \odot (1 - a_2)

Where \odot denotes element-wise multiplication.

Step 3: Gradients for Output Weights and Bias

Now we compute the gradients with respect to the weights and biases of the output layer:

LW2=Lz2z2W2=Lz2a1T\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot a_1^T Lb2=Lz2z2b2=Lz2\frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2}

Derivation:

For weights: LW2=Lz2z2W2\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2}

Recall that z2=W2a1+b2z_2 = W_2 a_1 + b_2. For a specific weight (W2)ij(W_2)_{ij}: (z2)i(W2)ij=(a1)j\frac{\partial (z_2)_i}{\partial (W_2)_{ij}} = (a_1)_j

This can be written in matrix form as: z2W2=a1T\frac{\partial z_2}{\partial W_2} = a_1^T

Therefore: LW2=Lz2a1T\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot a_1^T

This means each element L(W2)ij=L(z2)i(a1)j\frac{\partial \mathcal{L}}{\partial (W_2)_{ij}} = \frac{\partial \mathcal{L}}{\partial (z_2)_i} \cdot (a_1)_j

For biases: Lb2=Lz2z2b2\frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2}

Since z2b2=1\frac{\partial z_2}{\partial b_2} = 1 (element-wise), we have: Lb2=Lz2\frac{\partial \mathcal{L}}{\partial b_2} = \frac{\partial \mathcal{L}}{\partial z_2}

Step 4: Backprop to Hidden Layer

Next, we compute how the loss changes with respect to the hidden layer activations and pre-activations:

La1=Lz2z2a1=W2TLz2\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2} Lz1=La1a1z1=(La1)σ(z1)\frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} = \left( \frac{\partial \mathcal{L}}{\partial a_1} \right) \odot \sigma'(z_1)

Derivation:

For hidden activations: La1=Lz2z2a1\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1}

Recall that z2=W2a1+b2z_2 = W_2 a_1 + b_2. We have: (z2)i(a1)j=(W2)ij\frac{\partial (z_2)_i}{\partial (a_1)_j} = (W_2)_{ij}

In matrix form: z2a1=W2\frac{\partial z_2}{\partial a_1} = W_2

However, due to the way gradients flow in a computational graph, we need the transpose: La1=W2TLz2\frac{\partial \mathcal{L}}{\partial a_1} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2}

For hidden pre-activations: Lz1=La1a1z1\frac{\partial \mathcal{L}}{\partial z_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1}

Since a1=σ(z1)a_1 = \sigma(z_1), we have: a1z1=σ(z1)=σ(z1)(1σ(z1))=a1(1a1)\frac{\partial a_1}{\partial z_1} = \sigma'(z_1) = \sigma(z_1) \odot (1 - \sigma(z_1)) = a_1 \odot (1 - a_1)

Therefore: Lz1=(La1)σ(z1)=(W2TLz2)σ(z1)\frac{\partial \mathcal{L}}{\partial z_1} = \left( \frac{\partial \mathcal{L}}{\partial a_1} \right) \odot \sigma'(z_1) = \left( W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2} \right) \odot \sigma'(z_1)

Step 5: Gradients for Hidden Weights and Bias

Finally, we compute the gradients for the weights and biases of the hidden layer:

LW1=Lz1z1W1=Lz1xT\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot x^T Lb1=Lz1z1b1=Lz1\frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1}

Derivation:

For weights: LW1=Lz1z1W1\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}

Recall that z1=W1x+b1z_1 = W_1 x + b_1. For a specific weight (W1)ij(W_1)_{ij}: (z1)i(W1)ij=xj\frac{\partial (z_1)_i}{\partial (W_1)_{ij}} = x_j

In matrix form: z1W1=xT\frac{\partial z_1}{\partial W_1} = x^T

Therefore: LW1=Lz1xT\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot x^T

This means each element L(W1)ij=L(z1)ixj\frac{\partial \mathcal{L}}{\partial (W_1)_{ij}} = \frac{\partial \mathcal{L}}{\partial (z_1)_i} \cdot x_j

For biases: Lb1=Lz1z1b1\frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1}

Since z1b1=1\frac{\partial z_1}{\partial b_1} = 1 (element-wise), we have: Lb1=Lz1\frac{\partial \mathcal{L}}{\partial b_1} = \frac{\partial \mathcal{L}}{\partial z_1}

Gradient Descent Update Rule

Once we have computed all the gradients, we update the weights and biases using gradient descent:

WWηLWW \leftarrow W - \eta \cdot \frac{\partial \mathcal{L}}{\partial W} bbηLbb \leftarrow b - \eta \cdot \frac{\partial \mathcal{L}}{\partial b}

Where η\eta is the learning rate, a hyperparameter that controls the step size.

Specifically:

The negative sign is because we want to move in the direction of decreasing loss.

Computational Efficiency

Backpropagation is computationally efficient because:

  1. It reuses calculations (intermediate derivatives) as it propagates backward
  2. It avoids redundant calculations by computing gradients in a specific order
  3. Most operations are vectorized matrix operations, which can be efficiently implemented

Extension to Deep Networks

The same principles extend to networks with more layers:

  1. Perform a forward pass, storing all intermediate activations
  2. Compute the output error
  3. Propagate the error backward layer by layer
  4. Update all weights and biases using their respective gradients