Famous Convolutional Neural Network Architectures – #2

It has certainly been a while that I posted. Anyways, let’s get this thing rolling! In the last post, we went over some variants of convolution operations. This made us ready to get into some of the more advanced and efficient Convolutional Neural Net (CNN) architectures!

Let’s go over some of the powerful Convolutional Neural Networks which are certainly having a big impact on current computer vision industry.

INDEX

If you are here for a particular architecture, directly click on the link and jump to the topic.

  1. MobileNets
  2. ResNeXt
  3. SqueezeNet
  4. DenseNet

MobileNets – Howard et al

MobileNets, a class of efficient models, are based on depthwise separable convolutions. This gives the model a chance to reduce the number of parameters required for convolutional operations thus reducing the size of the model!

Google scientists’created this class of CNN architectures to get deep learning models accessible to smaller, less powerful devices like your smartphone. Let’s have a look at its architecture!

MobileNet – Architecture

MobileNet

The architecture followed an easy to replicate pattern!

  • Every conv layer, which is not Pointwise convolution, will have filter size – 3×3.
  • Stride of 2 is used in intermediate convolutions for downsampling.
  • A global pooling layer to flatten the output.

So, the architecture is:

Layer Type Params Input Size
Conv 3x3 filter size, 32 filters, stride 2 224x224x3
DW Conv 3x3 filter size, 32 filters, stride 1 112x112x32
Conv 1x1 filter size, 64 filters, stride 1 112x112x64
DW Conv 3x3 filter size, 64 filters, stride 2 112x112x64
Conv 1x1 filter size, 128 filters, stride 1 56x56x64
DW Conv 3x3 filter size, 128 filters, stride 1 56x56x128
Conv 1x1 filter size, 128 filters, stride 1 56x56x128
DW Conv 3x3 filter size, 128 filters, stride 2 56x56x128
Conv 1x1 filter size, 256 filters, stride 1 28x28x128
DW Conv 3x3 filter size, 256 filters, stride 1 28x28x256
Conv 1x1 filter size, 256 filters, stride 1 28x28x256
DW Conv 3x3 filter size, 256 filters, stride 2 28x28x256
Conv 1x1 filter size, 512 filters, stride 1 14x14x256
5x DW Conv
Conv
3x3 filter size, 512 filters, stride 1
1x1 filter size, 512 filters, stride 1
14x14x512
14x14x512
DW Conv 3x3 filter size, 512 filters, stride 2 14x14x512
Conv 1x1 filter size, 1024 filters, stride 1 7x7x512
DW Conv 3x3 filter size, 1024 filters, stride 1 7x7x1024
Conv 1x1 filter size, 1024 filters, stride 1 7x7x1024
Pool 7x7 filter size 7x7x1024
FC 1024x1000 1x1x1000
Softmax Classifier 1x1x1000

MobileNet – CODE

ResNeXt – Xie et al

An extension of Deep Residual Networks inspired by the split-transform-aggregate strategy used in Inception blocks. Rather than applying a residual block over the incoming feature map, we use branches (no. of branches = cardinality of network) of operations which merge into one and then a residual combination operation is applied!

Four things were discovered:

  1. Grouped convolutions led to specialization. Each group focused on different attributes of the input image. This means that each branch in a block has its own specialized attribute of the image that it was interested in.
  2. The experimental results displayed that increasing cardinality is more effective at increasing model performance than increasing depth or width of the network!
  3. Residual connections are crucial part in optimization!
  4. Aggregated transformations give strong representations!

ResNeXt – Architecture

ResNeXt Architecture can be considered as ResNet in disguise. We’ve a ResNeXt block that has branches, each of which is a simple feature extractor with three convolution layers stacked that act as a bottleneck. The second convolution layer is a grouped convolution operator, i.e., each channel in input has a separate filter (just like in depthwise convolution). The other thing to keep in mind is that after every convolution operation, a batch normalization and relu operations are applied. The complete architecture is displayed in figure and table below.

ResNeXt-block
Layer Name Output Size 32-layer
conv1 112x112 7x7, 64, stride 2
Maxpool 56x56 3x3 max pool, stride 2
conv2_x 56x56
1x1, 128 x 3
3x3, 128, C=32
1x1, 256
conv3_x 28x28
1x1, 256 x 4
3x3, 256, C=32
1x1, 512
conv4_x 14x14
1x1, 512 x 6
3x3, 512, C=32
1x1, 1024
conv5_x 7x7
1x1, 1024 x 3
3x3, 1024, C=32
1x1, 2048
Global Average Pooling
Fully Connected, units=1000
Softmax

ResNeXt – CODE

SqueezeNet – Landola et al

Want to have a memory efficient Deep Neural Network that can work on embedded devices? This might be your go to network! Let me tell you a secret, SqueezeNet achieves AlexNet like accuracy on ImageNet and yet has a 50 times fewer parameter than it. Interesting, isn’t it?

Following three main strategies were employed to do so:

  1. Replace 3×3 filters with 1×1 filters. The number of params in a 1×1 conv are 9 times lesser than the params in 3×3 conv.
  2. Decrease the number of input channels to 3×3 filters. The number of params in a 3×3 conv layer = (no. of input channels) x (no. of output channels) x 3 x 3. If we reduce no. of input channels, we reduce the number of params!
  3. Downsample late in the network so that convolution layers have large activation map. The paper Convolutional neural networks at constrained time cost by He and Sun displays that delayed downsampling leads to a higher accuracy. This gave an intuitive idea to the authors to use delayed downsampling and working with large activation maps.

SqueezeNet – Architecture

The squeezenet architecture is made up of what authors called a “Fire module”. Sure is a Godly name – “Lord of light”. (Okay okay no more Game of Thrones reference, Piyush!). Let’s move on.

So, what is a fire module? It is a fancy name for a bottleneck layer.

The fire module contains two things:

  1. A Squeeze layer that, as the name suggests, squeezes the input channels!
  2. An expand layer that, again as the name suggests, expands the input channels!

The expand layer of the fire module contains two types of convolutions:

  • A 1×1 convolution inspired from strategy 1 discussed above.
  • A 3×3 convolution.

The number of channels in Squeeze layer is less than sum of the number of channels in 1×1 and 3×3 layers. This is inspired from strategy 2 discussed above.

Now, let’s have a look at the complete architecture. The complete architecture has 3 forms:

  • A simple Squeezenet with no shortcut connections. (The leftmost one)
  • A Squeezenet with “Simple bypass” connections. These connections are placed between layers which have same number of output channels. (The middle one)
  • A Squeezenet with “Complex bypass” connections. These connections use 1×1 convolutions to add shortcut connections between layers with different number of output channels. (The rightmost one)
squeezenet

Finally, let’s have a look at parameters of these architectures.

Layer name Output Size Filter & Stride Squeeze layer filters Expand layer (1x1) filters Expand layer (3x3) filters
Input 224x224x3
conv1 111x111x96 7x7 s2 x96
maxpool1 55x55x96 3x3 s2
fire2 55x55x128 16 64 64
fire3 55x55x128 16 64 64
fire4 55x55x256 32 128 128
maxpool2 27x27x256 3x3 s2
fire5 27x27x256 32 128 128
fire6 27x27x384 48 192 192
fire7 27x27x384 48 192 192
fire8 27x27x512 64 256 256
maxpool3 13x13x512 3x3 s2
fire9 13x13x512 64 256 256
conv10 13x13x1000 1x1 s1 x1000
avgpool10 1x1x1000 13x13 s1

SqueezeNet – CODE

DenseNet – Huang et al

The paper that was titled the best paper at CVPR 2017. The paper leveraged the idea of ResNets and built over it for better flow of information between layers. This was achieved by proposing a connectivity pattern in which each layer was connected to all subsequent layers. That means lth layer received feature maps from all preceding layers. (We use concatenation operation to merge). Let’s look at the architecture now!

DenseNet – Architecture

DenseNet architecture comprises of Dense blocks and transition blocks:

Dense Block: A block of convolution layers such that every layer is connected (read concatenated) to every subsequent layer in the block.

Transition Block: A block where we downsample the information as we move from one Dense Block to another.

There were quite a few of things that were gone into deciding the structure of the model:

  1. Composite functions: Every conv layer shown is a combination of three consecutive functions – Batch Normalization, ReLU and Convolution.
  2. Growth Rate: If each layer produces k feature maps then the lth layer will produce k0 + k(l-1) feature maps. (k0 being the number of input feature maps). An important difference here is that the DenseNet can have very narrow layers. According to the author, one explanation can be that every layer has information of all the preceding layers.
  3. Bottleneck layers: As we know that we can use 1×1 convs to make a bottleneck and reduce number of params in a 3×3 convs. This is done in the dense blocks to achieve computational efficiency. The authors used 4k filters for 1×1 conv.
  4. Compression: To further the efficiency of the model, the authors reduced the number of feature maps at transition layers by a factor of Θ (reduction hyperparameter). This meant if a dense block outputs m channels then the number of channels a transition block will output = Θ m. To do this, the authors used a 1×1 conv layer before using the Pooling layer in the transition block.

Note: Authors used 2k filters in the initial convolution layer!

Let’s have a look at the complete architecture now:

Layer Name Output Size DenseNet-121 DenseNet-169 DenseNet-201
conv1 112x112 7x7, 64, stride 2
Maxpool 56x56 3x3 max pool, stride 2
Dense Block 1 56x56
1x1 conv x 6
3x3 conv
1x1 conv x 6
3x3 conv
1x1 conv x 6
3x3 conv
Transition layer 1 56x56 1x1 conv
28x28 2x2 average pool, stride 2
Dense Block 2 28x28
1x1 conv x 12
3x3 conv
1x1 conv x 12
3x3 conv
1x1 conv x 12
3x3 conv
Transition layer 2 28x28 1x1 conv
14x14 2x2 average pool, stride 2
Dense Block 3 14x14
1x1 conv x 24
3x3 conv
1x1 conv x 32
3x3 conv
1x1 conv x 48
3x3 conv
Transition layer 3 14x14 1x1 conv
7x7 2x2 average pool, stride 2
Dense Block 4 7x7
1x1 conv x 16
3x3 conv
1x1 conv x 32
3x3 conv
1x1 conv x 32
3x3 conv
Global Average Pooling
Fully Connected, units=1000
Softmax

DenseNet – CODE

So, we reached the end of this post. It certainly was fun to write this one. I still wanted to include one more CNN architecture but decided not to. This Convolutional Neural Network Architecture is so special that it will have a post of its own. The NASNets are coming soon. 😉

Leave a Reply

Close Menu

SUBSCRIBE FOR WEEKLY POST UPDATES <3