How do Computers See?

(This is part 3 in a series of posts on artificial intelligence and deep learning/neural networks. You can check out part 1 and part 2 if you haven’t yet read them and are new to AI)

There was a time when artificial intelligence was only home to our most creative imaginations. Yet, isn’t that where technology is born? In our imaginative minds? Though it is tempting to simply jump right into the technological advances that are the driving forces behind AI, we must first take a trip back in time and gander at how far we have come since Samuel Butler first wrote in 1906,

“There is no security against the ultimate development of mechanical consciousness, in the fact of machines possessing little consciousness now. A jellyfish has not much consciousness. Reflect upon the extraordinary advance which machines have made during the last few hundred years, and note how slowly the animal and vegetable kingdoms are advancing. The more highly organized machines are creatures not so much of yesterday, as of the last five minutes, so to speak, in comparison with past time.”

Since the first play written by Karel Capek in 1920, which depicted a race of self-replicating robot slaves who rose up and revolted against their human masters, to the most recent Star Trek character named Data, humans have always imagined the day machines would become intelligent.

Today, not only is AI a reality, but it is changing the very way we live and work. From AI in autonomous vehicles, which allow them to locate each other, to Google’s AI Voice Assistant, we are unwittingly surrounded by artificial intelligence. The question most ask is, “How does it all work?”

I could not answer that in one article. I will, however, try to cover a small subset of AI today that has given computers an ability most humans take for granted, but would greatly miss if it were taken away…the power of sight!

The Problem

Why has recognizing an image been so hard for computers and so easy for humans? The answer boils down to the algorithms used for both. Algorithms? Wait, our brains don’t have algorithms, do they??

I, and many others, do believe our brains have algorithms…a set of laws (physics) that are followed, which allow our brain to take data from our senses and transform it into something our consciousness can classify and understand.

Computer algorithms for vision have been nowhere near as sophisticated as our biological algorithms. That is until now.

Artificial Neural Networks Applied to Vision

(If you haven’t been introduced to neural networks yet, please check out this post first to get a quick introduction to the amazing world of ANNs)

Artificial neural networks (ANNs) have been around for awhile now, but recently a particular type of ANN has broken records for computer vision competitions and changed what we thought was possible in this problem space. We call this type of ANN a convolutional neural network.

Convolutional Neural Networks

Convolutional neural networks, also known as ConvNets or CNNs, are among the most effective computational models for performing certain tasks, such as pattern recognition. Yet, despite their importance to aspiring developers, many struggle with understanding just what CNNs are and how they work. To penetrate the mystery, we will work with the common application of CNNs to computer vision, which begins with a matrix of pixels. Then we’ll go layer by layer, and operation by operation, through the CNN’s deep structure, finally arriving at its output: the identification of a cloud, cat, tree, or whatever the CNNs best guess is about what it’s witnessing.

High-Level Architecture of a CNN

CNNArchitecture

source: ResearchGate.com

Here you can see the conceptual architecture of a typical (simple) CNN. To come up with a reasonable interpretation of what it’s witnessing, a CNN performs four essential operations, each corresponding to a type of layer found in its network.

These four essential operations (illustrated above) in a CNN are:

The Convolution Layer
The ReLU activation function
Pooling/subsampling Layer
Fully Connected ANN (Classification Layer)

The input is passed through each of these layers and will be classified in the output. Now let’s dig a little bit deeper into how each of these layers works.

The Input: A Matrix of Pixels

To keep things simple, we’ll only concern ourselves with the most common task CNNs perform: pattern or image recognition. Technically, a computer doesn’t see an image, but a matrix of pixels, each of which has three components: red, green and blue. Therefore, a 1,000-pixel image for us will have 3,000 pixels for a computer. It will then assign a value, or intensity, to each of those 3,000 pixels. The result is a matrix of 3,000 precise pixel intensities, which the computer must somehow interpret as one or more objects.

The Convolution Layer

The first key point to remember about the convolutional layer is that all of its units, or artificial neurons, are looking at distinct, but slightly overlapping, areas of the pixel matrix. Teachers and introductory texts often use the metaphor of parallel flashlight beams to help explain this idea. Suppose you have a parallel arrangement of flashlights with each of the narrow beams fixated on a different area of a large image, such as a billboard. The disk of light created by each beam on the billboard will overlap slightly with the disks immediately adjacent to it. The overall result is a grid of slightly overlapping disks of light.

featureMap

source: i.stack.imgur.com/GvsBA.jpg

The second point to remember about the convolution layer is that those units, or flashlights if you prefer, are all looking for the same pattern in their respective areas of the image. Collectively, the set of pattern-searching units in a convolutional layer is called a filter. The method the filter uses to search for a pattern is convolution.

The complete process of convolution involves some rather heavy mathematics. However, we can still understand it from a conceptual point of view, while only touching on the math in passing. To begin, every unit in a convolutional layer shares the same set of weights that it uses to recognize a specific pattern. This set of weights is generally pictured as a small, square matrix of values. The small matrix interacts with the larger pixel matrix that makes up the original image. For example, if the small matrix, technically called a convolution kernel, is a 3 x 3 matrix of weights, then it will cover a 3 x 3 array of pixels in the image. Naturally, there is a one-to-one relationship, in terms of size, between the 3 x 3 convolution kernel and the 3 x 3 section of the image it covers. With this in mind, you can easily multiply the weights in the kernel with the counterpart pixel-values in the section of the image at hand. The sum of those products, technically called the dot product, generates a single pixel value that the system assigns to that section of the new, filtered version of the image. This filtered image, known as the feature map, then serves as the input for the next layer in the ConvNet described below.

It’s important to note at this point that units in a convolutional layer of a ConvNet, unlike units in a layer of a fully-connected network, are not connected to units in their adjacent layers. Rather, a unit in a convolutional layer is only connected to the set of input units it is focused on. Here, the flashlight analogy is again useful. You can think of a unit in a convolutional layer as a flashlight that bears no relation to the flashlights ahead of it, or behind it. The flashlight is only connected to the section of the original image that it lights up.

The ReLU Activation Function

The rectified linear unit, or ReLU, performs the rectification operation on the feature map, which is the output of the convolution layer. The rectification operation introduces real-world non-linearity into the CNN in order to properly train and tune the network, using a feedback process known as back-propagation. Introducing non-linearity is important and powerful in neural networks to model problems (input parameters) that are inherently nonlinear by nature. relufamily

source: datasciencecentral.com

Above you can see three different implementations of a ReLU activation function (the most basic being just the ReLU). Different ReLUs are used in different problems to better break the linearity of input parameters most accurately.

The Pooling Layer

The more intricate the patterns the CNN searches for, the more convolution and ReLU layers are necessary. However, as layer after layer is progressively added, the computational complexity quickly becomes unwieldy.

source: wiki.tum.de

Another layer, called the pooling or subsampling layer, is now needed to keep the computational complexity from getting out of control. The pooling layer’s essential operation involves restricting the number of patterns the CNN concentrates on, isolating only the most relevant information for the purposes at hand.

The Classification Layer

Finally, the CNN requires one or more layers to classify the output of all previous layers into categories, such as cloud, cat, or tree.

The most obvious characteristic that distinguishes a classification layer from other layers in a CNN is that a classification layer is fully-connected. This means that it resembles a classic neural network (which we discussed in part 2), with the units in each layer connected to all of the units in their adjacent layers. Accordingly, classification layers often go by the name fully-connected layers, or FCs.

Depth and Complexity

Most CNNs are deep neural networks, meaning their architecture is quite complex, with dozens of layers. You might have, for example, a series of four alternating convolution and ReLU layers, followed by a pooling layer. Then this entire series of layers might, in turn, repeat several times before introducing a final series of fully-connected layers to classify the output.

Unraveling the Mystery of CNNs

Convolutional neural networks are deep, complex computational models that are ideal for performing certain tasks, such as image recognition.

source: computervisionblog.com

To understand how a CNN recognizes a pattern in an image, it’s valuable to go step by step through its operations and layers, beginning with its input: a matrix of pixel values. The first layer is the convolution layer, which uses the convolution operation to multiply a specific set of weights, the convolution kernel, by various sections of the image in order to filter for a particular pattern. The next layer is the ReLU layer, which introduces nonlinearity into the system to properly train the CNN. There may be a series of several alternations between convolution and ReLU layers before we reach the next layer, the pooling layer, which restricts the output to the most relevant patterns. The entire series of convolution, ReLU and pooling layers may, in turn, repeat several times before we reach the final classification layer. These are fully-connected layers that classify the CNNs output into likely categories, such as cloud, cat, tree, etc.

architectureEmergent

source: mdpi.com

This is just a high-level look at how a typical CNN is architected. There may be many variations that experts will use in practice to tune their network for their particular use cases. This is where the expertise comes into play. You may need to “tune” your network if the initial training does not produce as accurate of results as you had hoped. This process is called “Hyperparameter Tuning” and I will have to write another whole article just covering that. For now, familiarize yourself with the basics of ANNs and CNNs and come back soon to read about hyperparameter tuning in the near future!

As always, thanks so much for reading! Please tell me what you think or would like me to write about next in the comments. I’m open to criticism as well!

If you take the time to “like” or “share” the article, that would mean a lot to me. I write for free on my own time because I enjoy talking about technology and the more people that read my articles, the more individuals I get to geek out with!

Thanks and have a great day!