Famous Convolutional Neural Network Architectures – #1

In the last post, we went over the basics of a convolution layer. We discussed the smallest details of how it works and how we can stack multiple layers to define a complete neural network architecture.

Let’s go over some of the powerful Convolutional Neural Networks which laid the foundation of today’s Computer Vision achievements, achieved using Deep Learning.

INDEX

If you are here for a particular architecture, use the links below to jump to it.

  1. LeNet
  2. AlexNet
  3. VGGNet
  4. GoogLeNet/Inception
  5. ResNet

LeNet-5 – LeCun et al

LeNet-5, a 7 layer Convolutional Neural Network, was deployed in many banking systems to recognize hand-written numbers on cheques.

LeNet-5 – Architecture

The hand-written numbers were digitized into grayscale images of pixel size – 32×32. At that time, the computational capacity was limited and hence the technique wasn’t scalable to large scale images.

Let’s understand the architecture of the model. The model contained 7 layers excluding the input layer. Since it is a relatively small architecture, let’s go layer by layer:

  1. Layer 1: A convolutional layer with kernel size of 5×5, stride of 1×1 and 6 kernels in total. So the input image of size 32x32x1 gives an output of 28x28x6. Total params in layer = 5 * 5 * 6 + 6 (bias terms)
  2. Layer 2: A pooling layer with 2×2 kernel size, stride of 2×2 and 6 kernels in total. This pooling layer acted a little differently than what we discussed in previous post. The input values in the receptive were summed up and then were multiplied to a trainable parameter (1 per filter), the result was finally added to a trainable bias (1 per filter). Finally, sigmoid activation was applied to the output. So, the input from previous layer of size 28x28x6 gets sub-sampled to 14x14x6. Total params in layer = [1 (trainable parameter) + 1 (trainable bias)] * 6 = 12
  3. Layer 3: Similar to Layer 1, this layer is a convolutional layer with same configuration except it has 16 filters instead of 6. So the input from previous layer of size 14x14x6 gives an output of 10x10x16. Total params in layer = 5 * 5 * 16 + 16 = 416.
  4. Layer 4: Again, similar to Layer 2, this layer is a pooling layer with 16 filters this time around. Remember, the outputs are passed through sigmoid activation function. The input of size 10x10x16 from previous layer gets sub-sampled to 5x5x16. Total params in layer = (1 + 1) * 16 = 32
  5. Layer 5: This time around we have a convolutional layer with 5×5 kernel size and 120 filters. There is no need to even consider strides as the input size is 5x5x16 so we will get an output of 1x1x120. Total params in layer = 5 * 5 * 120 = 3000
  6. Layer 6: This is a dense layer with 84 parameters. So, the input of 120 units is converted to 84 units. Total params = 84 * 120 + 84 = 10164. The activation function used here was rather a unique one. I’ll say you can just try out any of your choice here as the task is pretty simple one by today’s standards.
  7. Output Layer: Finally, a dense layer with 10 units is used. Total params = 84 * 10 + 10 = 924.

Skipping over the details of loss function used and why it was used, I would suggest using cross-entropy loss with softmax activation in the last layer. Try out different training schedules and learning rates.

LeNet-5 – CODE

AlexNet – Krizhevsky et al

In 2012, a jaw dropping moment occurred when Hinton’s Deep Neural Network reduced the top-5 loss from 26% to 15.3% in the world’s most significant computer vision challenge – imagenet.

The network was very similar to LeNet but was much more deeper and had around 60 million parameters.

AlexNet – Architecture

Well that figure certainly looks scary. This is because the network was split into two halves, each trained simultaneously on two different GPUs. Let’s make this a little bit easy for us and bring a simpler version into the picture:

The architecture consists of 5 Convolutional Layers and 3 Fully Connected Layers. These 8 layers combined with two new concepts at that time – MaxPooling and ReLU activation gave their model an edge.

You can see the various layers and their configuration in the figure above. The layers are described in the table below:

Layer No Layer Type Configuration Output Shape
1 Convolution kernel size=11x11
strides=4x4
filters=96
padding=‘valid’
55x55x96
2 MaxPooling size=3x3
strides=2x2
27x27x96
3 Convolution kernel size=5x5
strides=1x1
filters=256
padding=‘same’
27x27x256
4 MaxPooling size=3x3
strides=2x2
13x13x256
5 Convolution kernel size=3x3
strides=1x1
filters=384
padding=‘same’
13x13x384
6 Convolution kernel size=3x3
strides=1x1
filters=384
padding=‘same’
13x13x384
7 Convolution kernel size=3x3
strides=1x1
filters=256
padding=‘same’
13x13x256
8 MaxPooling size=3x3
strides=2x2
6x6x256 = 9216
9 Fully Connected units=4096 4096
10 Fully Connected units=4096 4096
11 Fully Connected units=1000
softmax activation
1000

Note: ReLU activation is applied to the output of every Convolution and Fully Connected layer except the last softmax layer.

Various other techniques were used by the authors (few of them will be discussed in upcoming posts) – dropout, augmentation and Stochastic Gradient Descent with momentum.

AlexNet – CODE

VGGNet – Simonyan et al

The runner up of 2014 imagenet challenge is named VGGNet. Because of the simplicity of it’s uniform architecture, it appeals to a new-comer as a simpler form of a deep convolutional neural network.

In future posts, we will see how this network is one of the most used choices for feature extraction from images (taking images and converting them to a smaller dimensional array that contains important information regarding the image).

VGGNet – Architecture

VGGNet has 2 simple rules of thumb to be followed:

  1. Each Convolutional layer has configuration – kernel size = 3×3, stride = 1×1, padding = same. The only thing that differs is number of filters.
  2. Each Max Pooling layer has configuration – windows size = 2×2 and stride = 2×2. Thus, we half the size of the image at every Pooling layer.

The input image was an RGB image of 224×224 pixels. So input size = 224x224x3

Stage Layer No Layer Type Output
1 1 Convolution (64 filters) 224x224x64
1 2 Convolution (64 filters) 224x224x64
- - MaxPooling 112x112x64
2 1 Convolution (128 filters) 112x112x128
2 2 Convolution (128 filters) 112x112x128
- - MaxPooling 56x56x128
3 1 Convolution (256 filters) 56x56x256
3 2 Convolution (256 filters) 56x56x256
3 3 Convolution (256 filters) 56x56x256
- - MaxPooling 28x28x256
4 1 Convolution (512 filters) 28x28x512
4 2 Convolution (512 filters) 28x28x512
4 3 Convolution (512 filters) 28x28x512
- - MaxPooling 14x14x256
5 1 Convolution (512 filters) 14x14x512
5 2 Convolution (512 filters) 14x14x512
5 3 Convolution (512 filters) 14x14x512
- - MaxPooling 7x7x512
- - Fully Connected (4096 units) 4096
- - Fully Connected (4096 units) 4096
- - Fully Connected (1000 units) 1000
- - Softmax 1000

Total Params = 138 million. Most of these parameters are contributed by fully connected layers.

  • The first FC layer contributes = 4096 * (7 * 7 * 512) + 4096 = 102,764,544
  • The second FC layer contributes = 4096 * 4096 + 4096 = 16,781,312
  • The third FC layer contributes = 4096 * 1000 + 4096 = 4,100,096

Total params contributed by FC layers = 123,645,952.

VGGNet – CODE

GoogLeNet/Inception – Szegedy et al

The winner of the 2014 imagenet competition – GoogLeNet (a.k.a Inception v1), achieved a top-5 error rate of 6.67%. It used an inception module, a novel concept, with smaller convolutions that allowed the reduction of number of parameters to a mere 4 million.

Inception Modules. Source: Going deeper with convolutions

Reasons for using these inception modules:

  1. Each layer type extracts different information from input. Information gathered from a 3×3 layer will differ from information gathered from a 5×5 layer. How do we know which transformation will be the best at a given layer? So we use them all!
  2. Dimensionality reduction using 1×1 convolutions! Consider a 128x128x256 input. If we pass it through 20 filters of size 1×1, we will get an output of 128x128x20. So we apply them before the 3×3 or 5×5 convolutions to decrease the number of input filters to these layers in the inception block used for dimensionality reduction.

GoogLeNet/Inception – Architecture

The complete inception architecture:

You might see some “auxiliary classifiers” with softmax in this structure. Quoting paper here on this one – “By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.”

But what does it mean? Basically what they meant by:

  1. discrimination in the lower stages: We will train lower layers in network with gradients coming in from an earlier staged layer for output probabilities. This makes sure that the network has some discrimination earlier on about different objects.
  2. increase the gradient signal that gets propagated back: In deep neural networks, often, the gradients flowing back (using backpropagation), become so small that the earlier layers of network hardly learn. The earlier classification layers thus make it helpful by propagating a strong gradient signal to train the network.
  3. provide additional regularization: Deep Neural Networks tend to overfit (or cause high variance) the data while small Neural Networks tend to underfit (or cause high bias). The earlier classifiers regularize overfitting effect of the deeper layers!

Structure of Auxiliary classifiers:

Layer No Layer Type Configuration
1 Average Pooling kernel size=5x5
strides=3x3
2 Convolution kernel size=1x1
filters=128
activation=ReLU
3 Fully Connected units=1024
activation=ReLu
4 Dropout 0.7 ratio of dropout units
5 Fully Connected units=1000
activation=softmax
Auxiliary classifiers were applied to inception block 4 stage a and d. GoogLeNet Architecture:
Note: Here,
  • #1×1 represents the filters in 1×1 convolution in inception module.
  • #3×3 reduce represents the filters in 1×1 convolution before 3×3 convolution in inception module.
  • #5×5 reduce represents the filters in 1×1 convolution before 5×5 convolution in inception module.
  • #3×3 represents the filters in 3×3 convolution in inception module.
  • #5×5 represents the filters in 5×5 convolution in inception module.
  • Pool Proj represents the filters in 1×1 convolution before Max Pool in inception module.

Block Stage Layer Type Configuration Output Shape #1x1 #3x3
reduce
#3x3 #5x5
reduce
#5x5 pool proj
1 - Convolution kernel size=7x7
strides=2x2
filters=64
112x112x64 - - - - - -
1 - Max Pool size=3x3
strides=2x2
56x56x64 - - - - - -
2 - Convolution kernel size=1x1
filters=64
56x56x64 - - - - - -
2 - Convolution kernel size=3z3
strides=1x1
filters=192
padding=same
56x56x192 - - - - - -
2 - Max Pool size=3x3
strides=2x2
28x28x192 - - - - - -
3 a Inception - 28x28x256 64 96 128 16 32 32
3 b Inception - 28x28x480 128 128 192 32 96 64
3 - Max Pool size=3x3
strides=2x2
14x14x480 - - - - - -
4 a Inception - 14x14x512 192 96 208 16 48 64
4 b Inception - 14x14x512 160 112 224 24 64 64
4 c Inception - 14x14x512 128 128 256 24 64 64
4 d Inception - 14x14x528 112 144 288 32 64 64
4 e Inception - 14x14x832 256 160 320 32 128 128
4 - Max Pool size=3x3
strides=2x2
7x7x832 - - - - - -
5 a Inception - 7x7x832 256 160 320 32 128 128
5 b Inception - 7x7x1024 384 192 384 48 128 128
6 - Avg Pool size=7x7
strides=1x1
1x1x1024 - - - - - -
6 - Dropout p=0.4 1x1x1024 - - - - - -
7 - Fully Connected units=1000
activation=softmax
1x1x1024 - - - - - -

It used batch normalization, image distortions and RMSprop, things we will discuss in future posts.

GoogLeNet/Inception – CODE

ResNet – Kaiming He et al

The 2015 imagenet competition brought about a top-5 error rate of 3.57%, which is lower than the human error on top-5. This was due to ResNet (Residual Network) model used by microsoft at the competition. The network introduced a novel approach called – “skip connections”.

ResNet Block Displaying skip connection. Source: Deep Residual Learning for Image Recognition

The idea came out as a solution to an observation – Deep neural networks perform worse as we keep on adding layer. But intuitively speaking, this should not be the case. If a network with k layers performs as y, then a network with k+1 layers should at least perform y.

The observation brought about a hypothesis: direct mappings are hard to learn. So instead of learning mapping between output of layer and its input, learn the difference between them – learn the residual.

Say, x was the input and H(x) was the learnt output. So, we need to learn F(x) = H(x) – x. We can do this by first making a layer to learn F(x) and then adding x to F(x) hence achieving H(x). As a result, we are sending the same H(x) in next layer as we were supposed to before! This gave rise to the residual block we saw above.

The results were amazing as the vanishing gradients problem which usually make deep neural networks numb to learning were removed. How? The skip connections or the shortcuts, as we might say them, gave a shortcut to the gradients to the previous layers, skipping bunch of layers in between.

ResNet – Architecture

34 layer deep ResNet. Source: Deep Residual Learning for Image Recognition

The paper – Deep Residual Learning for Image Recognition has a well-defined table describing the architecture. Let’s use it here:

Layer Name Output Size 18-layer 34-layer 50-layer 101-layer 152-layer
conv1 112x112 7x7, 64, stride 2
3x3 max pool, stride 2
conv2_x 56x56
3x3, 64 x 2
3x3, 64
3x3, 64 x 3
3x3, 64
1x1, 64 x 3
3x3, 64
1x1, 256
1x1, 64 x 3
3x3, 64
1x1, 256
1x1, 64 x 3
3x3, 64
1x1, 256
conv3_x 28x28
3x3, 128 x 2
3x3, 128
3x3, 128 x 4
3x3, 128
1x1, 128 x 4
3x3, 128
1x1, 512
1x1, 128 x 4
3x3, 128
1x1, 512
1x1, 128 x 8
3x3, 128
1x1, 512
conv4_x 14x14
3x3, 256 x 2
3x3, 256
3x3, 256 x 6
3x3, 256
1x1, 256 x 6
3x3, 256
1x1, 1024
1x1, 256 x 23
3x3, 256
1x1, 1024
1x1, 256 x 36
3x3, 256
1x1, 1024
conv5_x 7x7
3x3, 512 x 2
3x3, 512
3x3, 512 x 3
3x3, 512
1x1, 512 x 3
3x3, 512
1x1, 2048
1x1, 512 x 3
3x3, 512
1x1, 2048
1x1, 512 x 3
3x3, 512
1x1, 2048
Global Average Pooling
Fully Connected, units=1000
Softmax

The paper mentions the usage of bottleneck for deeper ResNets – 50/101/152. Instead of using the residual block mentioned above, the network uses 1×1 convolutions to increase and decrease dimensionality of the number of channels.

ResNet – CODE

This brings us to the end of this post. In the next post, we will be discussing some new architecture that have drawn quite an attention in today’s deep learning world!

USEFUL DEEP LEARNING BOOKS

  1. Deep Learning with Python
  2. Deep Learning: A Practitioner’s Approach
  3. Deep Learning Book
  4. Hands-On Machine Learning with Scikit-Learn and TensorFlow

This Post Has 6 Comments

  1. Keep it up, great work.
    Would love to see you adding more networks like the DenseNet and SqueezeNet

    1. Thanks Bodhisatwa. Your appreciation motivates me to work hard. 🙂

      Models like DenseNet and SqueezeNet will be coming up in part 2 and/or 3 of series – “Famous Convolutional Neural Network Architectures”. Working on structuring the order with some simple topics in between.

  2. nice ! great work everyone should refer this post before implementations of any network whether its vanilla or customized model.

Leave a Reply

Close Menu

SUBSCRIBE FOR WEEKLY POST UPDATES <3