Basics of Linux Operating System and Command Line

The command line is a tool for interacting with a computer using only text (also known as a text interface) rather than other methods like clicking and scrolling. We’ve learned how to use the cd…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Convolutional Neural Network With Tensorflow and Keras

In this guide we will learn how to peform image classification and object detection/recognition using convolutional neural network. with something called a computer vision

The goal of our convolutional neural networks will be to classify and detect images or specific objects from within the image. We will be using image data as our features and a label for those images as our label or output.

We already know how neural networks work so we can skip through the basics and move right into explaining the following concepts.

The major differences we are about to see in these types of neural networks are the layers that make them up.

Now we are about to deal with image data that is usually made up of 3 dimensions. These 3 dimensions are as follows:

The only item in the list above you may not understand is color channels. The number of color channels represents the depth of an image and coorelates to the colors used in it. For example, an image with three channels is likely made up of rgb (red, green, blue) pixels. So, for each pixel we have three numeric values in the range 0–255 that define its color. For an image of color depth 1 we would likely have a greyscale image with one value defining each pixel, again in the range of 0–255.

Keep this in mind as we discuss how our network works and the input/output of each layer.

Note: I will use the term convnet and convolutional neural network interchangeably.

Each convolutional neural network is made up of one or many convolutional layers. These layers are different than the dense layers we have seen previously. Their goal is to find patterns from within images that can be used to classify the image or parts of it. But this may sound familiar to what our densely connected neural network in the previous section was doing, well that’s because it is.

The fundamental difference between a dense layer and a convolutional layer is that dense layers detect patterns globally while convolutional layers detect patterns locally. When we have a densly connected layer each node in that layer sees all the data from the previous layer. This means that this layer is looking at all the information and is only capable of analyzing the data in a global capacity. Our convolutional layer however will not be densly connected, this means it can detect local patterns using part of the input data to that layer.

Let’s have a look at how a densely connected layer would look at an image vs how a convolutional layer would.

This is our image; the goal of our network will be to determine whether this image is a cat or not.

Dense Layer: A dense layer will consider the ENTIRE image. It will look at all the pixels and use that information to generate some output.

Convolutional Layer: The convolutional layer will look at specific parts of the image. In this example let’s say it analyzes the highlighted parts below and detects patterns there.

Can you see why this might make these networks more useful?

A dense neural network learns patterns that are present in one specific area of an image. This means if a pattern that the network knows is present in a different area of the image it will have to learn the pattern again in that new area to be able to detect it.

Let’s use an example to better illustrate this.

We’ll consider that we have a dense neural network that has learned what an eye looks like from a sample of dog images.

Let’s say it’s determined that an image is likely to be a dog if an eye is present in the boxed off locations of the image above.

Now let’s flip the image.

Since our densely connected network has only recognized patterns globally it will look where it thinks the eyes should be present. Clearly it does not find them there and therefore would likely determine this image is not a dog. Even though the pattern of the eyes is present, it’s just in a different location.

Since convolutional layers learn and detect patterns from different areas of the image, they don’t have problems with the example we just illustrated. They know what an eye looks like and by analyzing different parts of the image can find where it is present.

In our models it is quite common to have more than one convolutional layer. Even the basic example we will use in this guide will be made up of 3 convolutional layers. These layers work together by increasing complexity and abstraction at each subsequent layer. The first layer might be responsible for picking up edges and short lines, while the second layer will take as input these lines and start forming shapes or polygons. Finally, the last layer might take these shapes and determine which combinations make up a specific image.

You may see me use the term feature map throughout this tutorial. This term simply stands for a 3D tensor with two special axes (width and height) and one depth axis. Our convolutional layers take feature maps as their input and return a new feature map that reprsents the prescence of spcific filters from the previous feature map. These are what we call response maps.

A convolutional layer is defined by two key parameters.

A filter is a m x n pattern of pixels that we are looking for in an image. The number of filters in a convolutional layer represents how many patterns each layer is looking for and what the depth of our response map will be. If we are looking for 32 different patterns/filters than our output feature map (aka the response map) will have a depth of 32. Each one of the 32 layers of depth will be a matrix of some size containing values indicating if the filter was present at that location or not.

Here’s a great illustration from the book “Deep Learning with Python” by Francois Chollet (pg 124).

This isn’t really the best term to describe this, but each convolutional layer is going to examine n x m blocks of pixels in each image. Typically, we’ll consider 3x3 or 5x5 blocks. In the example above we use a 3x3 “sample size”. This size will be the same as the size of our filter.

Our layers work by sliding these filters of n x m pixels over every possible position in our image and populating a new feature map/response map indicating whether the filter is present at each location.

The more mathematical of you may have realized that if we slide a filter of let’s say size 3x3 over our image well consider less positions for our filter than pixels in our input. Look at the example below.

Image from “Deep Learning with Python” by Francois Chollet (pg 126).

This means our response map will have a slightly smaller width and height than our original image. This is fine but sometimes we want our response map to have the same dimensions. We can accomplish this by using something called padding.

Padding is simply the addition of the appropriate number of rows and/or columns to your input data such that each pixel can be centered by the filter.

In the previous sections we assumed that the filters would be slid continously through the image such that it covered every possible position. This is common but sometimes we introduce the idea of a stride to our convolutional layer. The stride size reprsents how many rows/cols we will move the filter each time. These are not used very frequently so we’ll move on.

You may recall that our convnets are made up of a stack of convolution and pooling layers.

The idea behind a pooling layer is to down sample our feature maps and reduce their dimensions. They work in a similar way to convolutional layers where they extract windows from the feature map and return a response map of the max, min or average values of each channel. Pooling is usually done using windows of size 2x2 and a stride of 2. This will reduce the size of the feature map by a factor of two and return a response map that is 2x smaller.

Please refer to the video of legend andrew ng to learn how all of this happens at the lower level!

Now it is time to create our first convnet! This example is for the purpose of getting familiar with CNN architectures, we will talk about how to improves its performance later.

The labels in this dataset are the following:

We’ll load the dataset and have a look at some of the images below.

A common architecture for a CNN is a stack of Conv2D and MaxPooling2D layers followed by a few densely connected layers. To idea is that the stack of convolutional and maxPooling layers extract the features from the image. Then these features are flattened and fed to densely connected layers that determine the class of an image based on the presence of features.

We will start by building the Convolutional Base.

Layer 1

The input shape of our data will be 32, 32, 3 and we will process 32 filters of size 3x3 over our input data. We will also apply the activation function relu to the output of each convolution operation.

Layer 2

This layer will perform the max pooling operation using 2x2 samples and a stride of 2.

Other Layers

The next set of layers do very similar things but take as input the feature map from the previous layer. They also increase the frequency of filters from 32 to 64. We can do this as our data shrinks in special dimensions as it passed through the layers, meaning we can afford (computationally) to add more depth.

After looking at the summary you should notice that the depth of our image increases but the special dimensions reduce drastically.

So far, we have just completed the convolutional base. Now we need to take these extracted features and add a way to classify them. This is why we add the following layers to our model.

We can see that the flatten layer changes the shape of our data so that we can feed it to the 64-node dense layer, followed by the final output layer of 10 neurons (one for each class).

Now we will train and compile the model using the recommended hyper paramaters from tensorflow.

Note: This will take much longer than previous models!

We can determine how well the model performed by looking at it’s performance on the test data set

You should be getting an accuracy of about 70%. This isn’t bad for a simple model like this, but we’ll dive into some better approaches for computer vision below.

In the situation where you don’t have millions of images it is difficult to train a CNN from scratch that performs very well. This is why we will learn about a few techniques we can use to train CNN’s on small datasets of just a few thousand images.

To avoid overfitting and create a larger dataset from a smaller one we can use a technique called data augmentation. This is simply performing random transformations on our images so that our model can generalize better. These transformations can be things like compressions, rotations, stretches and even color changes.

Fortunately, keras can help us do this. Look at the code below to an example of data augmentation.

You would have noticed that the model above takes a few minutes to train and only gives an accuaracy of ~70%. This is okay but surely there is a way to improve on this.

In this section we will talk about using a pretrained CNN as apart of our own custom network to improve the accuracy of our model. We know that CNN’s alone (with no dense layers) don’t do anything other than map the presence of features from our input. This means we can use a pretrained CNN, one trained on millions of images, as the start of our model. This will allow us to have a very good convolutional base before adding our own dense layered classifier at the end. In fact, by using this techique we can train a very good classifier for a realtively small dataset (< 10,000 images). This is because the convnet already has a very good idea of what features to look for in an image and can find them very effectively. So, if we can determine the presence of features all the rest of the model needs to do is determine which combination of features makes a specific image.

When we employ the technique defined above, we will often want to tweak the final layers in our convolutional base to work better for our specific problem. This involves not touching or retraining the earlier layers in our convolutional base but only adjusting the final few. We do this because the first layers in our base are very good at extracting low level features lile lines and edges, things that are similar for any kind of image. Where the later layers are better at picking up very specific features like shapes or even eyes. If we adjust the final layers than we can look for only features relevant to our very specific problem.

In this section we will combine the tecniques we learned above and use a pretrained model and fine tuning to classify images of dogs and cats using a small dataset.

This dataset contains (image, label) pairs where images have different dimensions and 3 color channels.

Since the sizes of our images are all different, we need to convert them all to the same size. We can create a function that will do that for us below.

Now we can apply this function to all our images using .map().

Let’s have a look at our images now.

[ ]

for image, label in train.take(2):

Now if we look at the shape of an original image vs the new image we will see it has been changed.

[ ]

The model we are going to use as the convolutional base for our model is the MobileNet V2 developed at Google. This model is trained on 1.4 million images and has 1000 different classes.

We want to use this model but only its convolutional base. So, when we load in the model, we’ll specify that we don’t want to load the top (classification) layer. We’ll tell the model what input shape to expect and to use the predetermined weights from imagenet (Googles dataset).

The term freezing refers to disabling the training property of a layer. It simply means we won’t make any changes to the weights of any layers that are frozen during training. This is important as we don’t want to change the convolutional base that already has learned weights.

Now that we have our base layer setup, we can add the classifier. Instead of flattening the feature map of the base layer we will use a global average pooling layer that will average the entire 5x5 area of each 2D feature map and return to us a single 1280 element vector per filter.

Finally, we will add the predicition layer that will be a single dense neuron. We can do this because we only have two classes to predict for.

Now we will combine these layers together in a model.

Now we will train and compile the model. We will use a very small learning rate to ensure that the model does not have any major changes made to it.

And that’s it for this section on computer vision!

If you’d like to learn how you can perform object detection and recognition with tensorflow check out the guide below.

Add a comment

Related posts:

TURNING YOUR GARAGE INTO A HOME THEATER

This aticle is is reprinted from this website, for more details, please refer to this blog (https://www.monkeybarstorage.com/blog/turning-your-garage-home-theater). These past few years, we have all…

What Black History Taught Me About Being Black

I grew up in Charleston South Carolina, just about 7 miles from the port where Africans and Caribbean’s had been transported like animals in the belly of slave ships and sold on the auction block. I…

Is Putting Your Patient Data At Risk Worth It? Why Upgrading To Microsoft Windows 10 Is So Important

Even Microsoft is making a valiant effort to squash all Windows 7 users by releasing an update that will display a notification reminding users to upgrade to Windows 10 by January 14, 2020. There are…