Famous Convolutional Neural Network Architectures – #2
I'm Piyush Malhotra, a Delhilite who loves to dig Deep in the woods of Artificial Intelligence. I like to find new ways to solve not so new but interesting problems. Fitting new models to data and articulating new ways to manipulate and personify things is what I think my field is all about. When not working or playing with data, you'll find me in the gym or writing new blog posts.
January 30, 2019
It has certainly been a while that I posted. Anyways, let’s get this thing rolling! In the last post, we went over some variants of convolution operations. This made us ready to get into some of the more advanced and efficient Convolutional Neural Net (CNN) architectures!
Let’s go over some of the powerful Convolutional Neural Networks which are certainly having a big impact on current computer vision industry.
INDEX
If you are here for a particular architecture, directly click on the link and jump to the topic.
MobileNets – Howard et al
MobileNets, a class of efficient models, are based on depthwise separable convolutions. This gives the model a chance to reduce the number of parameters required for convolutional operations thus reducing the size of the model!
Google scientists’created this class of CNN architectures to get deep learning models accessible to smaller, less powerful devices like your smartphone. Let’s have a look at its architecture!
MobileNet – Architecture
The architecture followed an easy to replicate pattern!
 Every conv layer, which is not Pointwise convolution, will have filter size – 3×3.
 Stride of 2 is used in intermediate convolutions for downsampling.
 A global pooling layer to flatten the output.
So, the architecture is:
Layer Type  Params  Input Size  

Conv  3x3 filter size, 32 filters, stride 2  224x224x3  
DW Conv  3x3 filter size, 32 filters, stride 1  112x112x32  
Conv  1x1 filter size, 64 filters, stride 1  112x112x64  
DW Conv  3x3 filter size, 64 filters, stride 2  112x112x64  
Conv  1x1 filter size, 128 filters, stride 1  56x56x64  
DW Conv  3x3 filter size, 128 filters, stride 1  56x56x128  
Conv  1x1 filter size, 128 filters, stride 1  56x56x128  
DW Conv  3x3 filter size, 128 filters, stride 2  56x56x128  
Conv  1x1 filter size, 256 filters, stride 1  28x28x128  
DW Conv  3x3 filter size, 256 filters, stride 1  28x28x256  
Conv  1x1 filter size, 256 filters, stride 1  28x28x256  
DW Conv  3x3 filter size, 256 filters, stride 2  28x28x256  
Conv  1x1 filter size, 512 filters, stride 1  14x14x256  
5x  DW Conv Conv 
3x3 filter size, 512 filters, stride 1 1x1 filter size, 512 filters, stride 1 
14x14x512 14x14x512 
DW Conv  3x3 filter size, 512 filters, stride 2  14x14x512  
Conv  1x1 filter size, 1024 filters, stride 1  7x7x512  
DW Conv  3x3 filter size, 1024 filters, stride 1  7x7x1024  
Conv  1x1 filter size, 1024 filters, stride 1  7x7x1024  
Pool  7x7 filter size  7x7x1024  
FC  1024x1000  1x1x1000  
Softmax  Classifier  1x1x1000 
MobileNet – CODE
ResNeXt – Xie et al
An extension of Deep Residual Networks inspired by the splittransformaggregate strategy used in Inception blocks. Rather than applying a residual block over the incoming feature map, we use branches (no. of branches = cardinality of network) of operations which merge into one and then a residual combination operation is applied!
Four things were discovered:
 Grouped convolutions led to specialization. Each group focused on different attributes of the input image. This means that each branch in a block has its own specialized attribute of the image that it was interested in.
 The experimental results displayed that increasing cardinality is more effective at increasing model performance than increasing depth or width of the network!
 Residual connections are crucial part in optimization!
 Aggregated transformations give strong representations!
ResNeXt – Architecture
ResNeXt Architecture can be considered as ResNet in disguise. We’ve a ResNeXt block that has branches, each of which is a simple feature extractor with three convolution layers stacked that act as a bottleneck. The second convolution layer is a grouped convolution operator, i.e., each channel in input has a separate filter (just like in depthwise convolution). The other thing to keep in mind is that after every convolution operation, a batch normalization and relu operations are applied. The complete architecture is displayed in figure and table below.
Layer Name  Output Size  32layer  

conv1  112x112  7x7, 64, stride 2  
Maxpool  56x56  3x3 max pool, stride 2  
conv2_x  56x56 


conv3_x  28x28 


conv4_x  14x14 


conv5_x  7x7 


Global Average Pooling  
Fully Connected, units=1000  
Softmax 
ResNeXt – CODE
SqueezeNet – Landola et al
Want to have a memory efficient Deep Neural Network that can work on embedded devices? This might be your go to network! Let me tell you a secret, SqueezeNet achieves AlexNet like accuracy on ImageNet and yet has a 50 times fewer parameter than it. Interesting, isn’t it?
Following three main strategies were employed to do so:
 Replace 3×3 filters with 1×1 filters. The number of params in a 1×1 conv are 9 times lesser than the params in 3×3 conv.
 Decrease the number of input channels to 3×3 filters. The number of params in a 3×3 conv layer = (no. of input channels) x (no. of output channels) x 3 x 3. If we reduce no. of input channels, we reduce the number of params!
 Downsample late in the network so that convolution layers have large activation map. The paper Convolutional neural networks at constrained time cost by He and Sun displays that delayed downsampling leads to a higher accuracy. This gave an intuitive idea to the authors to use delayed downsampling and working with large activation maps.
SqueezeNet – Architecture
The squeezenet architecture is made up of what authors called a “Fire module”. Sure is a Godly name – “Lord of light”. (Okay okay no more Game of Thrones reference, Piyush!). Let’s move on.
So, what is a fire module? It is a fancy name for a bottleneck layer.
The fire module contains two things:
 A Squeeze layer that, as the name suggests, squeezes the input channels!
 An expand layer that, again as the name suggests, expands the input channels!
The expand layer of the fire module contains two types of convolutions:
 A 1×1 convolution inspired from strategy 1 discussed above.
 A 3×3 convolution.
The number of channels in Squeeze layer is less than sum of the number of channels in 1×1 and 3×3 layers. This is inspired from strategy 2 discussed above.
Now, let’s have a look at the complete architecture. The complete architecture has 3 forms:
 A simple Squeezenet with no shortcut connections. (The leftmost one)
 A Squeezenet with “Simple bypass” connections. These connections are placed between layers which have same number of output channels. (The middle one)
 A Squeezenet with “Complex bypass” connections. These connections use 1×1 convolutions to add shortcut connections between layers with different number of output channels. (The rightmost one)
Finally, let’s have a look at parameters of these architectures.
Layer name  Output Size  Filter & Stride  Squeeze layer filters  Expand layer (1x1) filters  Expand layer (3x3) filters 

Input  224x224x3  
conv1  111x111x96  7x7 s2 x96  
maxpool1  55x55x96  3x3 s2  
fire2  55x55x128  16  64  64  
fire3  55x55x128  16  64  64  
fire4  55x55x256  32  128  128  
maxpool2  27x27x256  3x3 s2  
fire5  27x27x256  32  128  128  
fire6  27x27x384  48  192  192  
fire7  27x27x384  48  192  192  
fire8  27x27x512  64  256  256  
maxpool3  13x13x512  3x3 s2  
fire9  13x13x512  64  256  256  
conv10  13x13x1000  1x1 s1 x1000  
avgpool10  1x1x1000  13x13 s1 
SqueezeNet – CODE
DenseNet – Huang et al
The paper that was titled the best paper at CVPR 2017. The paper leveraged the idea of ResNets and built over it for better flow of information between layers. This was achieved by proposing a connectivity pattern in which each layer was connected to all subsequent layers. That means l^{th} layer received feature maps from all preceding layers. (We use concatenation operation to merge). Let’s look at the architecture now!
DenseNet – Architecture
DenseNet architecture comprises of Dense blocks and transition blocks:
Dense Block: A block of convolution layers such that every layer is connected (read concatenated) to every subsequent layer in the block.
Transition Block: A block where we downsample the information as we move from one Dense Block to another.
There were quite a few of things that were gone into deciding the structure of the model:
 Composite functions: Every conv layer shown is a combination of three consecutive functions – Batch Normalization, ReLU and Convolution.
 Growth Rate: If each layer produces k feature maps then the l^{th} layer will produce k_{0} + k(l1) feature maps. (k_{0} being the number of input feature maps). An important difference here is that the DenseNet can have very narrow layers. According to the author, one explanation can be that every layer has information of all the preceding layers.
 Bottleneck layers: As we know that we can use 1×1 convs to make a bottleneck and reduce number of params in a 3×3 convs. This is done in the dense blocks to achieve computational efficiency. The authors used 4k filters for 1×1 conv.
 Compression: To further the efficiency of the model, the authors reduced the number of feature maps at transition layers by a factor of Θ (reduction hyperparameter). This meant if a dense block outputs m channels then the number of channels a transition block will output = Θ m. To do this, the authors used a 1×1 conv layer before using the Pooling layer in the transition block.
Note: Authors used 2k filters in the initial convolution layer!
Let’s have a look at the complete architecture now:
Layer Name  Output Size  DenseNet121  DenseNet169  DenseNet201  

conv1  112x112  7x7, 64, stride 2  
Maxpool  56x56  3x3 max pool, stride 2  
Dense Block 1  56x56 




Transition layer 1  56x56  1x1 conv  
28x28  2x2 average pool, stride 2  
Dense Block 2  28x28 




Transition layer 2  28x28  1x1 conv  
14x14  2x2 average pool, stride 2  
Dense Block 3  14x14 




Transition layer 3  14x14  1x1 conv  
7x7  2x2 average pool, stride 2  
Dense Block 4  7x7 




Global Average Pooling  
Fully Connected, units=1000  
Softmax 
DenseNet – CODE
So, we reached the end of this post. It certainly was fun to write this one. I still wanted to include one more CNN architecture but decided not to. This Convolutional Neural Network Architecture is so special that it will have a post of its own. The NASNets are coming soon. 😉