Convolutional Neural Network on FashionMNIST

As an educational project we implemented a simple CNN using PyTorch and visualize the output of each layer to build an intuitive understanding of what's happening. Our Python code trains the CNN on the FashionMNIST dataset, which consists of 60,000 training images, 10,000 test images and 10 classes of clothing items. Each image is a $28 \times 28$ pixel grayscale image, an example of which is shown in Figure 1.

Figure 1: An example image from the FashionMNIST dataset.

For this visualization, we're using a pre-trained network and focusing on how each layer transforms the input data. The first layer we send the image through is a convolutional layer with 16 filters and a 3×3 kernel size with padding=1. This layer takes in one grayscale image and produces 16 feature maps, where each map is the result of applying a different learned filter to the input.

Each convolution filter computes:

\text{output}(i,j,k) = \sum_{c=0}^{C-1} \sum_{u=0}^{K-1} \sum_{v=0}^{K-1} w_{k,c,u,v} \cdot x(c, i+u-\text{pad}, j+v-\text{pad}) + b_k

This produces 16 feature maps, which you can browse through in Figure 2 by clicking the left and right arrow buttons.

Image 1 / 16

Figure 2: The 16 feature maps from the first convolutional layer. Click the arrows to browse through them.

The next layer is a ReLU activation function, which applies the following function to each element of the feature map:

\text{ReLU}(x) = \max(0, x)

This is a non-linear activation function that introduces non-linearity into the model, allowing it to learn more complex patterns. The ReLU activation function is applied to each element of the feature map, and the output is shown in Figure 3.

Image 1 / 16

Figure 3: The output of the ReLU activation function applied to the first convolutional layer.

The next layer is a max pooling layer with a kernel size of 2×2, which reduces the spatial dimensions of the feature maps from 28×28 to 14×14. This downsampling reduces the number of parameters and computations in the network while making the feature maps more invariant to small translations. The max pooling operation takes the maximum value in each 2×2 patch and produces a smaller feature map, as shown in Figure 4.

Image 1 / 16

Figure 4: The output of the max pooling layer applied to the first convolutional layer.

The next layer is another convolutional layer, which takes the output of the max pooling layer as input. This layer applies 32 convolution filters to the 16 input feature maps, producing 32 output feature maps. The output of this layer is shown in Figure 5.

Image 1 / 32

Figure 5: The output of the second convolutional layer.

The next layer is another ReLU activation function, which is applied to the output of the second convolutional layer. The output of this layer is shown in Figure 6.

Image 1 / 32

Figure 6: The output of the ReLU activation function applied to the second convolutional layer.

The next layer is another max pooling layer with a 2×2 kernel, which further reduces the spatial dimensions from 14×14 to 7×7 while preserving the most important features. The output of this layer is shown in Figure 7.

Image 1 / 32

Figure 7: The output of the second max pooling layer.

The next layer is a flattening layer, which reshapes the output of the second max pooling layer into a one-dimensional vector of length 1568 (32×7×7). This is done to prepare the data for the fully connected layers that follow. The output of this layer is shown in Figure 8.

Figure 8: The output of the flattening layer.

The next layer is a fully connected layer, which takes the output of the flattening layer as input. This layer applies a linear transformation to the input data, reducing its dimensionality from 1568 to 128 neurons. The output of this layer is shown in Figure 9.

Figure 9: The output of the first fully connected layer.

The next layer is another ReLU activation function, which is applied to the output of the first fully connected layer. The output of this layer is shown in Figure 10.

Figure 10: The output of the ReLU activation function applied to the first fully connected layer.

The final layer is another fully connected layer, which takes the output of the ReLU activation function as input. This layer applies a linear transformation to the input data, reducing the 128 neurons to 10 output logits, one for each class in the FashionMNIST dataset. The output of this layer is shown in Figure 11.

Figure 11: The output of the second fully connected layer.

The final output of the network is a vector of 10 logits, one for each class in the FashionMNIST dataset (T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot). The class with the highest logit value is the network's prediction. In this case, the predicted class is 0, which corresponds to "T-shirt/top" in the FashionMNIST dataset.