Convolution and Variants
In the last post, we discussed some famous Convolutional Neural Networks. But that was Part 1 of the Famous Convolutional Neural Network Architectures series. Before going to Part 2, we need to go over some interesting variants of Convolution Operations! So let’s start.
All the animations used here are present in high resolution in this repo.
In one of my previous posts, we studied about basics of convolutions in detail. Let’s take a gist of it here! We take a volume of specific numbers (called kernel) having size smaller than (sometimes equal to) the input volume, with varying volume depth. This kernel when applied to the input volume performs following operations:
- Start from the top left corner of the input volume.
- Take slice of input volume from current position such that the slice is equal in field view as the kernel.
- Apply element-wise multiplication between slice and kernel.
- Add result over the axis that represents number of channels (in case of image, the axis representing the RGB channels)
- (if possible) Stride `s` steps ahead from current position
- (else) Stride `s` steps down from current position and begin from leftmost corner.
- Repeat steps 2 to 7 until you cannot go further.
The animation below describes how one of the many kernels present in the convolution layer works. We have a kernel with 3 channels and 3×3 size convolving over the input volume.
1×1 CONVOLUTIONS (a.k.a Network in Network)
Consider having a Multi-layer Perceptron that is embedded inside your convolutional Network without the need to flatten the image! But how does this help us? Let’s consider benefits of having a 1×1 Conv:
- Reducing or increasing the number of channels in the input volume. This is helpful when we need to do branching or we need to create a depth based bottleneck. Bottlenecks in models usually force models to find meaningful representations.
- Can be used to reduce the number of weights in a network block:
- Consider having an input volume of 128×128 with 256 channels. If we apply 512 kernels of 3×3 size to this input volume, then the number of kernel weights required equals 3x3x256x512 = 1,179,648 (not considering bias).
- Now let’s add a 1×1 convolution with 64 kernels between the input volume and the convolution layer with 512 kernels of 3×3. Weights in 1×1 conv layer becomes 1x1x64x256 = 16,384 and weights in the 3×3 conv layer becomes 3x3x64x512 = 294,912. Hence, the total number of weights in this new setting becomes 311,296. We just dropped the number of weights by almost a factor of 4.
The below animation describes the working of 1×1 convs aptly.
DEPTH-WISE SEPARABLE CONVOLUTION
Depth-wise Separable Convolutions are two step convolutions that came into existence as a solution to two main issues with simple convolutions:
- Reduce the complexity of the convolutional layer
- Reduce the number of parameters required
The two steps involved in separable convolutions are:
- Depth-wise Convolution
- Point-wise Convolution
As the name suggests, we perform kernel on depth of the input volume (on the input channels). The steps followed in this convolution are:
- Take number of kernels equal to the number of input channels, each kernel having depth 1. Example, if we have a kernel of size 3×3 and an input of size 6×6 with 16 channels, then there will be 16 3×3 kernels.
- Every channel thus has 1 kernel associated with it. This kernel is convolved over the associated channel separately resulting in 16 feature maps.
- Stack all these feature maps to get the output volume with 4×4 output size and 16 channels.
Again, as the name suggests, this type of convolution is applied to every single point in the convolution separately (remember 1×1 convs?). So how does this work?
- Take a 1×1 conv with number of filters equal to number of channels you want as output.
- Perform basic convolution applied in 1×1 conv to the output of the Depth-wise convolution.
Well, too fond of animation? See the 1×1 conv animation again \_( ^ _ ^ )_/
Try Depth-wise Separable Conv yourself using this excel sheet I curated.
In the upcoming posts, we will be in need to upsample the tensors that were downsampled by convolutional or pooling layers. We may need to define an image segmentation model or generative model, well-known models, and use transposed convolutions extensively.
Side note: In some materials you will see name deconvolution at places where we need transposed convolution. I may use it in future posts as well. In mathematical terms, transpose convolution and deconvolution are two different operations. But we will address tranposed convolution (operation) as deconvolution at times.
Moving on, let’s understand how transposed convolutions work:
- Take the input volume and add 0s at each alternate position. This gives us a new volume.
- Apply basic convolution on this new volume with the kernels.
The below animation describes it pretty well:
Try Transposed Convolution yourself using this excel sheet I curated.
There are quite a lot of variants of Convolutions, but these are some basic ones, which generally are used much more often than others!
Let’s clarify each other’s doubts in the comments below 🙂