A.I, Data Science, Machine Learning, Neural Networks, Deep learning … buzzword buzzword & even more buzzwords.
But do you know what a convolution neural network (CNN) is? Yes, no, kinda?
Well if someone told you that, he or she used an API or created an algorithm, which can recognise cute little kitties with a very high accuracy, then there is a big chance that a convolution neural network was involved.
But what is a convolution neural network?
A convolution neural network is exactly what you would expect. It is a deep neural network containing convolution layers, pooling layers and several fully connected neural network layer on top which …
(too technical, so let’s try this explanation again)
A convolution neural network is an algorithm, which can be used to detect objects in images.
Basically there are 2 or more 🙂 methods to detect objects in images.
A simple approach
In the figure below, you can see a very easy (but sometimes effective) approach to detect objects in images. In this simple approach the goal is to detect an object in an image by sliding an object-image over your image and search for the minimum result for the pixel subtraction of the two windows.
But this can be a very robust approach. E.g. what will happen if the object, which you want to detect, is a little bit smaller, bigger or even rotated. And what about the image quality? What if the image colours do not match well (shadows, lightning) or what happens when you have a blurry image …..
Also, how do you calculate the distance between your object and the potential objects in your image? E.g. in the above figure we’ve calculated the amount of pixels that doesn’t match and as you would notice, we have 3 possible locations in our image. But if we take into account that this image contains some distorted pixels, then we might say that our algorithm is not able to find the correct position of the object. And what about the threshold?
Instead of looking at the whole image and trying to find out if the object is in that image or not, we can look at smaller parts of the image and object-images. And that is exactly what a convolution neural network will do.
In general, a convolution neural network consists of 3 building blocks which can be stacked :
- convolution layer
- normalization step
- pooling layer
and it ends with one or more fully connected neural layers.
[ Normally, the convolution, normalization and the pooling layers can be repeated 4-8 times, but when you’re using Google-like applications, this can easily go up to 20+ times … ]
If you still don’t have a clue what I’m talking about, hopefully the next example will make everything a bit clearer.
In this next example we walk through a convolution neural network which is already trained. But before we start, we first have to talk about features and filters. When you start working on a CNN, you have to define some parameters and two of these parameters are the amount of filters you want to use and the size of these filters.
But what are these filters?
In fact, these filters help us to detect patterns in the image. Each filters starts out as a set of randomly generated values, but as soon as the Convolution neural network starts to learn, each value will adapt itself in such a way that the final error becomes increasingly smaller. In other words, by adding more images to the network (feed forward) and calculating the error, we can adapt each and every filter in such a way that the next time, the error is just a little bit smaller (back-propagation – gradient descent)
However, for our example we will only use 2 features that have already been trained.
Now that we have some filters (remember that they might contain random values at the beginning), we can use them as our sliding window. In a convolution layer, we slide (just like before) filters over our images and calculate how well each filter matches with the underlying content. But instead of counting all the matching pixels (+1), we will also penalise all the mismatching pixels (-1). The result of this, is called a convolution map or feature map.
[Note, when a CNN calculates these values on a coloured image, it might use the dot product]
If you’ve played with the above figure, you’ll notice how well each filter is represented on a certain position. Also you’ll see that the slash and backslashfilters are able to create diagonal shapes in the convolution map. And last but not least, if you combine the two convolution maps or feature maps (e.g. like with the cross-image), you’ll notice that you might end up with a cross-kind-image.
Some extra info:
Convolution: The act of convolving an image with multiple filters/features
A layer: This operation can be stacked on top of other layers, like e.g. the normalization layers or pooling layers, ….
#Outputs: The amount of filters determines the amount of convolution images. There are as many filtered images as filters used.
Normalization / Activation function layer
In general when we use a neural network, we use activation functions to create non-linearity. if we won’t use them in a neural network, the network will act as a basic linear transformation function (linear regression) and that is not very useful for computing complex tasks. Therefore, when we do add non-linear activation functions (like Sigmoid, tanh, ReLU, …), a multi-layer feed-forward network will become an universal function approximator. This means that it should be able to approach every thinkable mathematical function.
That being said, in our example we will use the ReLu activation function which is the same as replacing all the negative values by zero.
Using activation layers is actually a must. Like said before if you skip these layers, no magic will happen.
Also, when you chain multiple convolution-layers together without any activation layer, then it might behave the same as one single convolution.
In the last step of our convolution cycle, we’re going to execute some pooling.
Pooling is nothing more then shrinking the convolution image via another sliding window.
To be concrete, when we slide our window over a convolution image, we are going to store the maximum value of that window into a new image. As a result, we have a similar image but just a little bit smaller. Instead of a 5×5 image, we have now a 2×2 image.
You might think, why is this useful? Well, if we have a smaller image, like ours, it wouldn’t make immediately sense, but imagine when you’ve a photo of 4128×3096. In that case, shrinking is really recommended.
Also, pooling does not take into account on which position the maximum value was found, whereby it is less sensitive to distorted pixels.
Fully connected layers
As you should know by know, the power of a CNN lies in the combination of all these layers. Especially when you stack multiple cyclies on top of each other. Also, like we have discussed before, it is strongly advised to use an activation layer after each convolutional layer. But this does not mean that you’ve to add a pooling layer after each activation layer. (In the next figure, you can see a possible architecture)
Last but not least, we have to create some fully connected layers or a (normal) neural network which tells us if there is an object in the image. Therefore we have to stretch the last values of the last layer into one big vector and add this vector into a network.
And that is the way you do it. 🙂
You’ve basically seen how a CNN works. If you’ve still some questions or remarks, be free to leave a message or to send us an e-mail.
Nevertheless, I do realize that not everything is covered in-depth, but the purpose of this blog is mainly to find a good balance between exploring a few technical aspects and the key concepts of a CNN.