Famous Convolutional Neural Network Architectures – #1
I'm Piyush Malhotra, a Delhilite who loves to dig Deep in the woods of Artificial Intelligence. I like to find new ways to solve not so new but interesting problems. Fitting new models to data and articulating new ways to manipulate and personify things is what I think my field is all about. When not working or playing with data, you'll find me in the gym or writing new blog posts.
November 9, 2018
In the last post, we went over the basics of a convolution layer. We discussed the smallest details of how it works and how we can stack multiple layers to define a complete neural network architecture.
Let’s go over some of the powerful Convolutional Neural Networks which laid the foundation of today’s Computer Vision achievements, achieved using Deep Learning.
INDEX
If you are here for a particular architecture, use the links below to jump to it.
LeNet5 – LeCun et al
LeNet5, a 7 layer Convolutional Neural Network, was deployed in many banking systems to recognize handwritten numbers on cheques.
LeNet5 – Architecture
The handwritten numbers were digitized into grayscale images of pixel size – 32×32. At that time, the computational capacity was limited and hence the technique wasn’t scalable to large scale images.
Let’s understand the architecture of the model. The model contained 7 layers excluding the input layer. Since it is a relatively small architecture, let’s go layer by layer:
 Layer 1: A convolutional layer with kernel size of 5×5, stride of 1×1 and 6 kernels in total. So the input image of size 32x32x1 gives an output of 28x28x6. Total params in layer = 5 * 5 * 6 + 6 (bias terms)
 Layer 2: A pooling layer with 2×2 kernel size, stride of 2×2 and 6 kernels in total. This pooling layer acted a little differently than what we discussed in previous post. The input values in the receptive were summed up and then were multiplied to a trainable parameter (1 per filter), the result was finally added to a trainable bias (1 per filter). Finally, sigmoid activation was applied to the output. So, the input from previous layer of size 28x28x6 gets subsampled to 14x14x6. Total params in layer = [1 (trainable parameter) + 1 (trainable bias)] * 6 = 12
 Layer 3: Similar to Layer 1, this layer is a convolutional layer with same configuration except it has 16 filters instead of 6. So the input from previous layer of size 14x14x6 gives an output of 10x10x16. Total params in layer = 5 * 5 * 16 + 16 = 416.
 Layer 4: Again, similar to Layer 2, this layer is a pooling layer with 16 filters this time around. Remember, the outputs are passed through sigmoid activation function. The input of size 10x10x16 from previous layer gets subsampled to 5x5x16. Total params in layer = (1 + 1) * 16 = 32
 Layer 5: This time around we have a convolutional layer with 5×5 kernel size and 120 filters. There is no need to even consider strides as the input size is 5x5x16 so we will get an output of 1x1x120. Total params in layer = 5 * 5 * 120 = 3000
 Layer 6: This is a dense layer with 84 parameters. So, the input of 120 units is converted to 84 units. Total params = 84 * 120 + 84 = 10164. The activation function used here was rather a unique one. I’ll say you can just try out any of your choice here as the task is pretty simple one by today’s standards.
 Output Layer: Finally, a dense layer with 10 units is used. Total params = 84 * 10 + 10 = 924.
Skipping over the details of loss function used and why it was used, I would suggest using crossentropy loss with softmax activation in the last layer. Try out different training schedules and learning rates.
LeNet5 – CODE
AlexNet – Krizhevsky et al
In 2012, a jaw dropping moment occurred when Hinton’s Deep Neural Network reduced the top5 loss from 26% to 15.3% in the world’s most significant computer vision challenge – imagenet.
The network was very similar to LeNet but was much more deeper and had around 60 million parameters.
AlexNet – Architecture
Well that figure certainly looks scary. This is because the network was split into two halves, each trained simultaneously on two different GPUs. Let’s make this a little bit easy for us and bring a simpler version into the picture:
Source:Â Deep Learning Specialization
The architecture consists of 5 Convolutional Layers and 3 Fully Connected Layers. These 8 layers combined with two new concepts at that time – MaxPooling and ReLU activation gave their model an edge.
You can see the various layers and their configuration in the figure above. The layers are described in the table below:
Layer No  Layer Type  Configuration  Output Shape 

1  Convolution  kernel size=11x11 strides=4x4 filters=96 padding=â€˜validâ€™ 
55x55x96 
2  MaxPooling  size=3x3 strides=2x2 
27x27x96 
3  Convolution  kernel size=5x5 strides=1x1 filters=256 padding=â€˜sameâ€™ 
27x27x256 
4  MaxPooling  size=3x3 strides=2x2 
13x13x256 
5  Convolution  kernel size=3x3 strides=1x1 filters=384 padding=â€˜sameâ€™ 
13x13x384 
6  Convolution  kernel size=3x3 strides=1x1 filters=384 padding=â€˜sameâ€™ 
13x13x384 
7  Convolution  kernel size=3x3 strides=1x1 filters=256 padding=â€˜sameâ€™ 
13x13x256 
8  MaxPooling  size=3x3 strides=2x2 
6x6x256 = 9216 
9  Fully Connected  units=4096  4096 
10  Fully Connected  units=4096  4096 
11  Fully Connected  units=1000 softmax activation 
1000 
Note: ReLU activation is applied to the output of every Convolution and Fully Connected layer except the last softmax layer.
Various other techniques were used by the authors (few of them will be discussed in upcoming posts) – dropout, augmentation and Stochastic Gradient Descent with momentum.
AlexNet – CODE
VGGNet – Simonyan et al
The runner up of 2014 imagenet challenge is named VGGNet. Because of the simplicity of it’s uniform architecture, it appeals to a newcomer as a simpler form of a deep convolutional neural network.
In future posts, we will see how this network is one of the most used choices for feature extraction from images (taking images and converting them to a smaller dimensional array that contains important information regarding the image).
VGGNet – Architecture
VGGNet has 2 simple rules of thumb to be followed:
 Each Convolutional layer has configuration – kernel size = 3×3, stride = 1×1, padding = same. The only thing that differs is number of filters.
 Each Max Pooling layer has configuration – windows size = 2×2 and stride = 2×2. Thus, we half the size of the image at every Pooling layer.
The input image was an RGB image of 224×224 pixels. So input size = 224x224x3
Stage  Layer No  Layer Type  Output 

1  1  Convolution (64 filters)  224x224x64 
1  2  Convolution (64 filters)  224x224x64 
    MaxPooling  112x112x64 
2  1  Convolution (128 filters)  112x112x128 
2  2  Convolution (128 filters)  112x112x128 
    MaxPooling  56x56x128 
3  1  Convolution (256 filters)  56x56x256 
3  2  Convolution (256 filters)  56x56x256 
3  3  Convolution (256 filters)  56x56x256 
    MaxPooling  28x28x256 
4  1  Convolution (512 filters)  28x28x512 
4  2  Convolution (512 filters)  28x28x512 
4  3  Convolution (512 filters)  28x28x512 
    MaxPooling  14x14x256 
5  1  Convolution (512 filters)  14x14x512 
5  2  Convolution (512 filters)  14x14x512 
5  3  Convolution (512 filters)  14x14x512 
    MaxPooling  7x7x512 
    Fully Connected (4096 units)  4096 
    Fully Connected (4096 units)  4096 
    Fully Connected (1000 units)  1000 
    Softmax  1000 
Total Params = 138 million. Most of these parameters are contributed by fully connected layers.
 The first FC layer contributes = 4096 * (7 * 7 * 512) + 4096 = 102,764,544
 The second FC layer contributes = 4096 * 4096 + 4096 = 16,781,312
 The third FC layer contributes = 4096 * 1000 + 4096 = 4,100,096
Total params contributed by FC layers = 123,645,952.
VGGNet – CODE
GoogLeNet/Inception – Szegedy et al
The winner of the 2014 imagenet competition – GoogLeNet (a.k.a Inception v1), achieved a top5 error rate of 6.67%. It used an inception module, a novel concept, with smaller convolutions that allowed the reduction of number of parameters to a mere 4 million.
Inception Modules. Source:Â Going deeper with convolutions
Reasons for using these inception modules:
 Each layer type extracts different information from input. Information gathered from a 3×3 layer will differ from information gathered from a 5×5 layer. How do we know which transformation will be the best at a given layer? So we use them all!
 Dimensionality reduction using 1×1 convolutions! Consider a 128x128x256 input. If we pass it through 20 filters of size 1×1, we will get an output of 128x128x20. So we apply them before the 3×3 or 5×5 convolutions to decrease the number of input filters to these layers in the inception block used for dimensionality reduction.
GoogLeNet/Inception – Architecture
The complete inception architecture:
Source:Â Going deeper with convolutions
You might see some “auxiliary classifiers” with softmax in this structure. Quoting paper here on this one – “By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization.”
But what does it mean? Basically what they meant by:
 discrimination in the lower stages: We will train lower layers in network with gradients coming in from an earlier staged layer for output probabilities. This makes sure that the network has some discrimination earlier on about different objects.
 increase the gradient signal that gets propagated back: In deep neural networks, often, the gradients flowing back (using backpropagation), become so small that the earlier layers of network hardly learn. The earlier classification layers thus make it helpful by propagating a strong gradient signal to train the network.
 provide additional regularization: Deep Neural Networks tend to overfit (or cause high variance) the data while small Neural Networks tend to underfit (or cause high bias). The earlier classifiers regularize overfitting effect of the deeper layers!
Structure of Auxiliary classifiers:
Layer No  Layer Type  Configuration 

1  Average Pooling  kernel size=5x5 strides=3x3 
2  Convolution  kernel size=1x1 filters=128 activation=ReLU 
3  Fully Connected  units=1024 activation=ReLu 
4  Dropout  0.7 ratio of dropout units 
5  Fully Connected  units=1000 activation=softmax 
Note: Here,
 #1×1 represents the filters in 1×1 convolution in inception module.
 #3×3 reduce represents the filters in 1×1 convolution before 3×3 convolution in inception module.
 #5×5 reduce represents the filters in 1×1 convolution before 5×5 convolution in inception module.
 #3×3 represents the filters in 3×3 convolution in inception module.
 #5×5 represents the filters in 5×5 convolution in inception module.
 Pool Proj represents the filters in 1×1 convolution before Max Pool in inception module.
Block  Stage  Layer Type  Configuration  Output Shape  #1x1  #3x3 reduce 
#3x3  #5x5 reduce 
#5x5  pool proj 

1    Convolution  kernel size=7x7 strides=2x2 filters=64 
112x112x64             
1    Max Pool  size=3x3 strides=2x2 
56x56x64             
2    Convolution  kernel size=1x1 filters=64 
56x56x64             
2    Convolution  kernel size=3z3 strides=1x1 filters=192 padding=same 
56x56x192             
2    Max Pool  size=3x3 strides=2x2 
28x28x192             
3  a  Inception    28x28x256  64  96  128  16  32  32 
3  b  Inception    28x28x480  128  128  192  32  96  64 
3    Max Pool  size=3x3 strides=2x2 
14x14x480             
4  a  Inception    14x14x512  192  96  208  16  48  64 
4  b  Inception    14x14x512  160  112  224  24  64  64 
4  c  Inception    14x14x512  128  128  256  24  64  64 
4  d  Inception    14x14x528  112  144  288  32  64  64 
4  e  Inception    14x14x832  256  160  320  32  128  128 
4    Max Pool  size=3x3 strides=2x2 
7x7x832             
5  a  Inception    7x7x832  256  160  320  32  128  128 
5  b  Inception    7x7x1024  384  192  384  48  128  128 
6    Avg Pool  size=7x7 strides=1x1 
1x1x1024             
6    Dropout  p=0.4  1x1x1024             
7    Fully Connected  units=1000 activation=softmax 
1x1x1024             
It used batch normalization, image distortions and RMSprop, things we will discuss in future posts.
GoogLeNet/Inception – CODE
ResNet – Kaiming He et al
The 2015 imagenet competition brought about a top5 error rate of 3.57%, which is lower than the human error on top5. This was due to ResNet (Residual Network) model used by microsoft at the competition. The network introduced a novel approach called – “skip connections”.
ResNet Block Displaying skip connection. Source: Deep Residual Learning for Image Recognition
The idea came out as a solution to an observation – Deep neural networks perform worse as we keep on adding layer. But intuitively speaking, this should not be the case. If a network with k layers performs as y, then a network with k+1 layers should at least perform y.
The observation brought about a hypothesis: direct mappings are hard to learn. So instead of learning mapping between output of layer and its input, learn the difference between them – learn the residual.
Say, x was the input and H(x) was the learnt output. So, we need to learn F(x) = H(x) – x. We can do this by first making a layer to learn F(x) and then adding x to F(x) hence achieving H(x). As a result, we are sending the same H(x) in next layer as we were supposed to before! This gave rise to the residual block we saw above.
The results were amazing as the vanishing gradients problem which usually make deep neural networks numb to learning were removed. How? The skip connections or the shortcuts, as we might say them, gave a shortcut to the gradients to the previous layers, skipping bunch of layers in between.
ResNet – Architecture
34 layer deep ResNet. Source: Deep Residual Learning for Image Recognition
The paper – Deep Residual Learning for Image Recognition has a welldefined table describing the architecture. Let’s use it here:
Layer Name  Output Size  18layer  34layer  50layer  101layer  152layer  

conv1  112x112  7x7, 64, stride 2  
3x3 max pool, stride 2  
conv2_x  56x56 






conv3_x  28x28 






conv4_x  14x14 






conv5_x  7x7 






Global Average Pooling  
Fully Connected, units=1000  
Softmax 
The paper mentions the usage of bottleneck for deeper ResNets – 50/101/152. Instead of using the residual block mentioned above, the network uses 1×1 convolutions to increase and decrease dimensionality of the number of channels.
ResNet – CODE
This brings us to the end of this post. In the next post, we will be discussing some new architecture that have drawn quite an attention in today’s deep learning world!
This Post Has 6 Comments
Pingback: Convolutional Neural Networks  Introduction  Predictive Programmer
Bodhisatwa Mandal
10 Nov 2018Keep it up, great work.
Would love to see you adding more networks like the DenseNet and SqueezeNet
Piyush
10 Nov 2018Thanks Bodhisatwa. Your appreciation motivates me to work hard. ðŸ™‚
Models like DenseNet and SqueezeNet will be coming up in part 2 and/or 3 of series – “Famous Convolutional Neural Network Architectures”. Working on structuring the order with some simple topics in between.
Amiya Mandal
13 Nov 2018nice ! great work everyone should refer this post before implementations of any network whether its vanilla or customized model.
Piyush
13 Nov 2018Thank you Amiya. ðŸ™‚
Pingback: Convolution and Variants  Predictive Programmer