Back propagation
Backpropagation is the fundamental algorithm that enables neural networks to learn. It efficiently computes the gradient of the loss function with respect to each weight and bias parameter in a neural network. These gradients indicate how much each parameter contributes to the error, allowing us to update them using gradient descent to minimize the loss.
Simple Feedforward Network Setup
Architecture:
- Input layer → Hidden layer (sigmoid activation) → Output layer (sigmoid activation)
- Loss function: Mean Squared Error (MSE)
For clarity, let's assume:
- Input dimension: n
- Hidden layer dimension: h
- Output dimension: m
Notation
Let's define our variables clearly:
-
x∈Rn: Input vector (n-dimensional)
-
W1∈Rh×n: Weight matrix connecting input to hidden layer
-
b1∈Rh: Bias vector for hidden layer
-
z1=W1x+b1∈Rh: Pre-activation values for hidden layer
-
a1=σ(z1)∈Rh: Activation values for hidden layer
-
W2∈Rm×h: Weight matrix connecting hidden to output layer
-
b2∈Rm: Bias vector for output layer
-
z2=W2a1+b2∈Rm: Pre-activation values for output layer
-
a2=σ(z2)∈Rm: Activation values for output layer (final output)
-
y∈Rm: True output (target)
-
L=21∥a2−y∥2: Loss function (Mean Squared Error)
Where σ(z)=1+e−z1 is the sigmoid activation function.
Forward Pass (Step-by-Step)
-
Input Layer:
- Take input vector x∈Rn
-
Hidden Layer Computation:
- Compute weighted sum plus bias: z1=W1x+b1
- W1x is a matrix-vector multiplication
- Each element (z1)i=∑j=1n(W1)ijxj+(b1)i
- Apply activation function: a1=σ(z1)
- Applied element-wise: (a1)i=σ((z1)i)=1+e−(z1)i1
-
Output Layer Computation:
- Compute weighted sum plus bias: z2=W2a1+b2
- Each element (z2)i=∑j=1h(W2)ij(a1)j+(b2)i
- Apply activation function: a2=σ(z2)
- Applied element-wise: (a2)i=σ((z2)i)=1+e−(z2)i1
-
Loss Calculation:
- Compute MSE loss: L=21∥a2−y∥2=21∑i=1m((a2)i−yi)2
Backpropagation Steps
Backpropagation uses the chain rule of calculus to compute gradients, working backward from the output to the input.
Step 1: Gradient w.r.t. Output Activation
We begin by computing how the loss changes with respect to the output activations:
∂a2∂L=a2−y
Derivation:
L=21∥a2−y∥2=21∑i=1m((a2)i−yi)2
Taking the derivative with respect to (a2)j:
∂(a2)j∂L=∂(a2)j∂(21∑i=1m((a2)i−yi)2)
Only the term where i=j contributes to the derivative:
∂(a2)j∂L=21⋅2⋅((a2)j−yj)⋅1=(a2)j−yj
Therefore, in vector form: ∂a2∂L=a2−y
Step 2: Gradient w.r.t. Output Pre-activation
Next, we compute how the loss changes with respect to the pre-activation values of the output layer:
∂z2∂L=∂a2∂L⋅∂z2∂a2=(a2−y)⊙σ′(z2)
Derivation:
Using the chain rule, we have:
∂z2∂L=∂a2∂L⋅∂z2∂a2
We already computed ∂a2∂L=a2−y
For ∂z2∂a2, recall that a2=σ(z2). The derivative of the sigmoid function is:
σ′(z)=σ(z)(1−σ(z))
Therefore:
∂z2∂a2=σ′(z2)=σ(z2)⊙(1−σ(z2))=a2⊙(1−a2)
Combining these, we get:
∂z2∂L=(a2−y)⊙σ′(z2)=(a2−y)⊙a2⊙(1−a2)
Where ⊙ denotes element-wise multiplication.
Step 3: Gradients for Output Weights and Bias
Now we compute the gradients with respect to the weights and biases of the output layer:
∂W2∂L=∂z2∂L⋅∂W2∂z2=∂z2∂L⋅a1T
∂b2∂L=∂z2∂L⋅∂b2∂z2=∂z2∂L
Derivation:
For weights:
∂W2∂L=∂z2∂L⋅∂W2∂z2
Recall that z2=W2a1+b2. For a specific weight (W2)ij:
∂(W2)ij∂(z2)i=(a1)j
This can be written in matrix form as:
∂W2∂z2=a1T
Therefore:
∂W2∂L=∂z2∂L⋅a1T
This means each element ∂(W2)ij∂L=∂(z2)i∂L⋅(a1)j
For biases:
∂b2∂L=∂z2∂L⋅∂b2∂z2
Since ∂b2∂z2=1 (element-wise), we have:
∂b2∂L=∂z2∂L
Step 4: Backprop to Hidden Layer
Next, we compute how the loss changes with respect to the hidden layer activations and pre-activations:
∂a1∂L=∂z2∂L⋅∂a1∂z2=W2T⋅∂z2∂L
∂z1∂L=∂a1∂L⋅∂z1∂a1=(∂a1∂L)⊙σ′(z1)
Derivation:
For hidden activations:
∂a1∂L=∂z2∂L⋅∂a1∂z2
Recall that z2=W2a1+b2. We have:
∂(a1)j∂(z2)i=(W2)ij
In matrix form:
∂a1∂z2=W2
However, due to the way gradients flow in a computational graph, we need the transpose:
∂a1∂L=W2T⋅∂z2∂L
For hidden pre-activations:
∂z1∂L=∂a1∂L⋅∂z1∂a1
Since a1=σ(z1), we have:
∂z1∂a1=σ′(z1)=σ(z1)⊙(1−σ(z1))=a1⊙(1−a1)
Therefore:
∂z1∂L=(∂a1∂L)⊙σ′(z1)=(W2T⋅∂z2∂L)⊙σ′(z1)
Step 5: Gradients for Hidden Weights and Bias
Finally, we compute the gradients for the weights and biases of the hidden layer:
∂W1∂L=∂z1∂L⋅∂W1∂z1=∂z1∂L⋅xT
∂b1∂L=∂z1∂L⋅∂b1∂z1=∂z1∂L
Derivation:
For weights:
∂W1∂L=∂z1∂L⋅∂W1∂z1
Recall that z1=W1x+b1. For a specific weight (W1)ij:
∂(W1)ij∂(z1)i=xj
In matrix form:
∂W1∂z1=xT
Therefore:
∂W1∂L=∂z1∂L⋅xT
This means each element ∂(W1)ij∂L=∂(z1)i∂L⋅xj
For biases:
∂b1∂L=∂z1∂L⋅∂b1∂z1
Since ∂b1∂z1=1 (element-wise), we have:
∂b1∂L=∂z1∂L
Gradient Descent Update Rule
Once we have computed all the gradients, we update the weights and biases using gradient descent:
W←W−η⋅∂W∂L
b←b−η⋅∂b∂L
Where η is the learning rate, a hyperparameter that controls the step size.
Specifically:
- W2←W2−η⋅∂W2∂L
- b2←b2−η⋅∂b2∂L
- W1←W1−η⋅∂W1∂L
- b1←b1−η⋅∂b1∂L
The negative sign is because we want to move in the direction of decreasing loss.
Computational Efficiency
Backpropagation is computationally efficient because:
- It reuses calculations (intermediate derivatives) as it propagates backward
- It avoids redundant calculations by computing gradients in a specific order
- Most operations are vectorized matrix operations, which can be efficiently implemented
Extension to Deep Networks
The same principles extend to networks with more layers:
- Perform a forward pass, storing all intermediate activations
- Compute the output error
- Propagate the error backward layer by layer
- Update all weights and biases using their respective gradients