Saturday, March 26, 2022

Ruminating on Convolutional Neural Networks

Convolutional Neural Nets (CNNs) have made Computer Vision a reality. But to understand CNNs, we need to get basics right - What exactly is a convolution? What is a kernel/filter?

The kernel or filter is a small matrix that is multiplied by the source image matrix to extract features. So you can have a kernel that identifies edges or corners of a photo. Then there could be kernels that detect patterns - e.g. eyes, stripes.

A convolution is a mathematical operation where a kernel (aka filter) moves across the input image and does a dot product of the kernel and the original image. This dot product is saved as a new matrix and is called as the feature map. An excellent video visualizing this operation is available here - https://youtu.be/pj9-rr1wDhM

Image manipulation software such as Photoshop also use kernels for effects such as 'blur background'. 

One fundamental advantage of the convolution operation is that if a particular filter is designed to detect a specific type of feature in the input, then applying that filter systematically across the entire input image allows us to discover that feature anywhere in the image. Also note that a particular convolutional layer can have multiple kernels/filters. So after the input layer (a single matrix), the convolutional layer (having 6 filters) will produce 6 output matrices. A cool application to visualize this is here - https://www.cs.ryerson.ca/~aharley/vis/conv/flat.html

A suite of tens or even hundreds of other small filters can be designed to detect other features in the image. After a convolutional layer, we also typically add a pooling layer. Pooling layers are used to downsize the features maps - keeping the important parts and discarding the rest. The output matrices of the pooling layer are smaller in size and faster to process. 

So as you can see, the fundamental difference between a densely connected layer and a convolutional layer is that dense fully connected layers learn global patterns (involving all pixels) whereas convolution layers learn local features (edges, corners, textures, etc.) 

Using CNNs, we can create a hierarchy of patterns - i.e. the second layer learns from the first layer. CNNs are also reusable, so we can take an image classification model trained on https://www.image-net.org/ dataset and add additional layers to customize it for our purpose. 

A good introduction to CNN models is given in this article - https://towardsdatascience.com/convolution-neural-networks-a-beginners-guide-implementing-a-mnist-hand-written-digit-8aa60330d022 with a good PyTorch implementation for MNIST dataset here - https://towardsdatascience.com/mnist-handwritten-digits-classification-using-a-convolutional-neural-network-cnn-af5fafbc35e9

No comments:

Post a Comment