The Curious Programmer

Software, Gadgets, Books, and All Things Geek

Understanding Recurrent Neural Networks: The Preferred Neural Network for Time-Series Data — June 26, 2017

Understanding Recurrent Neural Networks: The Preferred Neural Network for Time-Series Data

Artificial intelligence has been in the background for decades, kicking up dust in the distance, but never quite arriving. Well that era is over. In 2017, AI has broken through the dust cloud and arrived in a big way. But why? What’s the big deal all of a sudden? And what do recurrent neural networks have to do with it? Well, a lot, actually. Thanks to an ingenious form of short-term memory that is unheard of in conventional neural networks, today’s recurrent neural networks (RNNs) have been proving themselves as powerful predictive engines. When it comes to certain sequential machine learning tasks, such as speech recognition, RNNs are reaching levels of predictive accuracy, time and time again, that no other algorithm can match. However, the first generation of RNNs, back in the day, were not so hot. They suffered from a serious setback in their error-tweaking process that held up their progress for decades. Finally, a major breakthrough came in the late 90s that led to a new generation of far more accurate RNNs. Building on that breakthrough for nearly twenty years, developers refined and perfected their new RNNs until all-star apps such as Google Voice Search and Apple’s Siri started snatching them up to power key processes. Now recurrent networks are showing up everywhere, and are helping to ignite the AI renaissance that’s unfolding right now.

Neural Networks That Cling to the Past

Most artificial neural networks, such as feedforward neural networks, have no memory of the input they received just one moment ago. For example, if you provide a feedforward neural network with the sequence of letters “WISDOM,” when it gets to “D,” it has already forgotten that it just read “S.” That’s a big problem. No matter how hard you train it, it will always struggle to guess the most likely next character: “O.” This makes it a rather crappy candidate for certain tasks, such as speech recognition, that greatly benefit from the capacity to predict what’s coming next. Recurrent networks, on the other hand, do remember what they’ve just encountered, and at a remarkably sophisticated level.

Let’s take the example of the input “WISDOM” again and apply it to a recurrent network. The unit, or artificial neuron, of the RNN, upon receiving the “D” also takes as its input the character it received one moment ago, the “S.” In other words, it adds the immediate past to the present. This gives it the advantage of a limited short-term memory that, along with its training, provides enough context for guessing what the next character is most likely to be: “O.”

Tweaking and Re-tweaking

If you like to get into the weeds, this is where you get excited. Otherwise, get ready for a rough patch. But hang in there, it’s worth it. Like all artificial neural networks, the units of an RNN assign a matrix of weights to their multiple inputs, then apply a function to those weights to determine a single output. However, recurrent networks apply weights not only to their present inputs, but also to their inputs from a moment ago. Then they adjust the weights assigned to their present and past inputs through a process that involves two key concepts that you’ll definitely want to know if you really want to get into AI: gradient descent and backpropogation through time (BPTT).

Gradient Descent

One of the most famous algorithms in machine learning is known as gradient descent. Its primary virtue is its remarkable capacity to sidestep the dreaded “curse of dimensionality.” This issue plagues systems, such as neural networks, with far too many variables to make a brute-force calculation of their optimal values possible. Gradient descent, however, breaks the curse of dimensionality by zooming in on the local low-point, or local minimum, of the multi-dimensional error or cost function. This helps the system determine the tweaked value, or weight, to assign to each of the units in the network, bringing accuracy back in line.

Backpropogation Through Time

The RNN trains its units by adjusting their weights following a slight modification of a feedback process known as backpropogation. Okay, this is a weird concept. But if you’re into AI, you’ll learn to love it. The process of backpropogation works its way back, layer by layer, from the network’s final output, tweaking the weights of each unit, or artificial neuron, according to the unit’s calculated portion of the total output error. Got it? If so, get ready for one more layer of complexity. Recurrent neural networks use a heavier version of this process known as backpropogation through time (BPTT). This version extends the tweaking process to include the weight of the T-1 input values responsible for each unit’s memory of the prior moment.

Yikes: The Vanishing Gradient Problem

Despite enjoying some initial success with the help of gradient descent and BPTT, many artificial neural networks, including the first generation of RNNs, eventually ran out gas. Technically, they suffered a serious setback known as the vanishing gradient problem. Although the details fall way outside the scope of this sweeping overview, the basic idea is pretty straightforward. First, let’s look at the notion of a gradient. Like its simpler relative, the derivative, you can think of a gradient as a slope. In the context of training a deep neural network, the larger the gradient, the steeper the slope, the more quickly the system can roll downhill to the finish line and complete its training. But this is where developers ran into trouble — their slopes were too flat for fast training. This was particularly problematic in the first layers of their deep networks, which are the most critical when it comes to proper tweaking of memory units. Here the gradient values got so small, and their corresponding slopes so flat, that one could describe them as “vanishing,” thus the vanishing gradient problem. As the gradients got smaller and smaller, and thus flatter and flatter, the training times grew unbearably long. It was an error-correction nightmare without end.

The Big Breakthrough: Long Short-Term Memory

Finally, in the late 90s, a major breakthrough solved the vanishing descent problem and gave a second wind to recurrent network development. At the center of this new approach were units of long short-term memory (LSTM).

As weird as that sounds, the long and short of it is that LSTM made a world of difference in the field AI. These new units, or artificial neurons, like the standard short-term memory units of RNNs, remember their inputs from a moment ago. However, unlike standard RNN units, LSTMs can hang on to their memories, which have read/write properties akin to memory registers in a conventional computer. Yet LSTMs have analog, rather than digital, memory, making their functions differentiable. In other words, their curves are continuous and you can find the steepness of their slopes. So they are a good fit for the partial differential calculus involved in backpropogation and gradient descent.

Altogether, LSTMs can not only tweak their weights, but retain, delete, transform and otherwise control the inflow and outflow of their stored data according to the quirks of their training. Most importantly, LSTMs can cling to important error information for long enough to keep gradients relatively steep and thus training periods relatively short. This wipes out the vanishing gradient problem and greatly improves the accuracy of today’s LSTM-based recurrent networks. Thanks to this remarkable improvement in the RNN architecture, Google, Apple and many other leading companies, not to mention startups, are now using RNNs to power applications at the center of their businesses. In short, RNNs are suddenly a big deal.

What to Remember about RNNs

Let’s recap the highlights of these amazing memory machines. Recurrent neural networks, or RNNs, can remember their former inputs, which gives them a big edge over other artificial neural networks when it comes to sequential, context-sensitive tasks such as speech recognition. However, the first generation of RNNs hit the wall when it came to their capacity to correct for errors through the all-important twin processes of backpropogation and gradient descent. Known as the dreaded vanishing gradient problem, this stumbling block virtually halted progress in the field until 1997, when a major breakthrough introduced a vastly improved LSTM-based architecture to the field. The new approach, which effectively turned each unit in a recurrent network into an analogue computer, greatly increased accuracy and helped lead to the renaissance in AI we’re seeing all around us today.

If you have enjoyed this post, the biggest compliment you could give would be to share this with someone that you think would enjoy it!

If you would like to see more articles like this, click the subscribe button and never miss a post. Have a great day and never stop learning!

12 Most Influential Books Every Software Engineer Needs to Read — June 21, 2017
How do Computers See? — June 19, 2017

How do Computers See?

(This is part 3 in a series of posts on artificial intelligence and deep learning/neural networks. You can check out part 1 and part 2 if you haven’t yet read them and are new to AI)

There was a time when artificial intelligence was only home to our most creative imaginations. Yet, isn’t that where technology is born? In our imaginative minds? Though it is tempting to simply jump right into the technological advances that are the driving forces behind AI, we must first take a trip back in time and gander at how far we have come since Samuel Butler first wrote in 1906,

“There is no security against the ultimate development of mechanical consciousness, in the fact of machines possessing little consciousness now. A jellyfish has not much consciousness. Reflect upon the extraordinary advance which machines have made during the last few hundred years, and note how slowly the animal and vegetable kingdoms are advancing. The more highly organized machines are creatures not so much of yesterday, as of the last five minutes, so to speak, in comparison with past time.”

Since the first play written by Karel Capek in 1920, which depicted a race of self-replicating robot slaves who rose up and revolted against their human masters, to the most recent Star Trek character named Data, humans have always imagined the day machines would become intelligent.

Today, not only is AI a reality, but it is changing the very way we live and work. From AI in autonomous vehicles, which allow them to locate each other, to Google’s AI Voice Assistant, we are unwittingly surrounded by artificial intelligence. The question most ask is, “How does it all work?”

I could not answer that in one article. I will, however, try to cover a small subset of AI today that has given computers an ability most humans take for granted, but would greatly miss if it were taken away…the power of sight!

The Problem

Why has recognizing an image been so hard for computers and so easy for humans? The answer boils down to the algorithms used for both. Algorithms? Wait, our brains don’t have algorithms, do they??

I, and many others, do believe our brains have algorithms…a set of laws (physics) that are followed, which allow our brain to take data from our senses and transform it into something our consciousness can classify and understand.

Computer algorithms for vision have been nowhere near as sophisticated as our biological algorithms. That is until now.

Artificial Neural Networks Applied to Vision

(If you haven’t been introduced to neural networks yet, please check out this post first to get a quick introduction to the amazing world of ANNs)

Artificial neural networks (ANNs) have been around for awhile now, but recently a particular type of ANN has broken records for computer vision competitions and changed what we thought was possible in this problem space. We call this type of ANN a convolutional neural network.

Convolutional Neural Networks

Convolutional neural networks, also known as ConvNets or CNNs, are among the most effective computational models for performing certain tasks, such as pattern recognition. Yet, despite their importance to aspiring developers, many struggle with understanding just what CNNs are and how they work. To penetrate the mystery, we will work with the common application of CNNs to computer vision, which begins with a matrix of pixels. Then we’ll go layer by layer, and operation by operation, through the CNN’s deep structure, finally arriving at its output: the identification of a cloud, cat, tree, or whatever the CNNs best guess is about what it’s witnessing.

High-Level Architecture of a CNN

CNNArchitecture

source: ResearchGate.com

Here you can see the conceptual architecture of a typical (simple) CNN. To come up with a reasonable interpretation of what it’s witnessing, a CNN performs four essential operations, each corresponding to a type of layer found in its network.

These four essential operations (illustrated above) in a CNN are:

  1. The Convolution Layer
  2. The ReLU activation function
  3. Pooling/subsampling Layer
  4.  Fully Connected ANN (Classification Layer)

The input is passed through each of these layers and will be classified in the output. Now let’s dig a little bit deeper into how each of these layers works.

The Input: A Matrix of Pixels

To keep things simple, we’ll only concern ourselves with the most common task CNNs perform: pattern or image recognition. Technically, a computer doesn’t see an image, but a matrix of pixels, each of which has three components: red, green and blue. Therefore, a 1,000-pixel image for us will have 3,000 pixels for a computer. It will then assign a value, or intensity, to each of those 3,000 pixels. The result is a matrix of 3,000 precise pixel intensities, which the computer must somehow interpret as one or more objects.

The Convolution Layer

The first key point to remember about the convolutional layer is that all of its units, or artificial neurons, are looking at distinct, but slightly overlapping, areas of the pixel matrix. Teachers and introductory texts often use the metaphor of parallel flashlight beams to help explain this idea. Suppose you have a parallel arrangement of flashlights with each of the narrow beams fixated on a different area of a large image, such as a billboard. The disk of light created by each beam on the billboard will overlap slightly with the disks immediately adjacent to it. The overall result is a grid of slightly overlapping disks of light.

featureMap

source: i.stack.imgur.com/GvsBA.jpg

The second point to remember about the convolution layer is that those units, or flashlights if you prefer, are all looking for the same pattern in their respective areas of the image. Collectively, the set of pattern-searching units in a convolutional layer is called a filter. The method the filter uses to search for a pattern is convolution.

The complete process of convolution involves some rather heavy mathematics. However, we can still understand it from a conceptual point of view, while only touching on the math in passing. To begin, every unit in a convolutional layer shares the same set of weights that it uses to recognize a specific pattern. This set of weights is generally pictured as a small, square matrix of values. The small matrix interacts with the larger pixel matrix that makes up the original image. For example, if the small matrix, technically called a convolution kernel, is a 3 x 3 matrix of weights, then it will cover a 3 x 3 array of pixels in the image. Naturally, there is a one-to-one relationship, in terms of size, between the 3 x 3 convolution kernel and the 3 x 3 section of the image it covers. With this in mind, you can easily multiply the weights in the kernel with the counterpart pixel-values in the section of the image at hand. The sum of those products, technically called the dot product, generates a single pixel value that the system assigns to that section of the new, filtered version of the image. This filtered image, known as the feature map, then serves as the input for the next layer in the ConvNet described below.

It’s important to note at this point that units in a convolutional layer of a ConvNet, unlike units in a layer of a fully-connected network, are not connected to units in their adjacent layers. Rather, a unit in a convolutional layer is only connected to the set of input units it is focused on. Here, the flashlight analogy is again useful. You can think of a unit in a convolutional layer as a flashlight that bears no relation to the flashlights ahead of it, or behind it. The flashlight is only connected to the section of the original image that it lights up.

The ReLU Activation Function

The rectified linear unit, or ReLU, performs the rectification operation on the feature map, which is the output of the convolution layer. The rectification operation introduces real-world non-linearity into the CNN in order to properly train and tune the network, using a feedback process known as back-propagation. Introducing non-linearity is important and powerful in neural networks to model problems (input parameters) that are inherently nonlinear by nature.relufamily

source: datasciencecentral.com

Above you can see three different implementations of a ReLU activation function (the most basic being just the ReLU). Different ReLUs are used in different problems to better break the linearity of input parameters most accurately.

The Pooling Layer

The more intricate the patterns the CNN searches for, the more convolution and ReLU layers are necessary. However, as layer after layer is progressively added, the computational complexity quickly becomes unwieldy.

pooling

source: wiki.tum.de

Another layer, called the pooling or subsampling layer, is now needed to keep the computational complexity from getting out of control. The pooling layer’s essential operation involves restricting the number of patterns the CNN concentrates on, isolating only the most relevant information for the purposes at hand.

The Classification Layer

Finally, the CNN requires one or more layers to classify the output of all previous layers into categories, such as cloud, cat, or tree.

The most obvious characteristic that distinguishes a classification layer from other layers in a CNN is that a classification layer is fully-connected. This means that it resembles a classic neural network (which we discussed in part 2), with the units in each layer connected to all of the units in their adjacent layers. Accordingly, classification layers often go by the name fully-connected layers, or FCs.

Depth and Complexity

Most CNNs are deep neural networks, meaning their architecture is quite complex, with dozens of layers. You might have, for example, a series of four alternating convolution and ReLU layers, followed by a pooling layer. Then this entire series of layers might, in turn, repeat several times before introducing a final series of fully-connected layers to classify the output.

Unraveling the Mystery of CNNs

Convolutional neural networks are deep, complex computational models that are ideal for performing certain tasks, such as image recognition.

carExample

source: computervisionblog.com

To understand how a CNN recognizes a pattern in an image, it’s valuable to go step by step through its operations and layers, beginning with its input: a matrix of pixel values. The first layer is the convolution layer, which uses the convolution operation to multiply a specific set of weights, the convolution kernel, by various sections of the image in order to filter for a particular pattern. The next layer is the ReLU layer, which introduces nonlinearity into the system to properly train the CNN. There may be a series of several alternations between convolution and ReLU layers before we reach the next layer, the pooling layer, which restricts the output to the most relevant patterns. The entire series of convolution, ReLU and pooling layers may, in turn, repeat several times before we reach the final classification layer. These are fully-connected layers that classify the CNNs output into likely categories, such as cloud, cat, tree, etc.

architectureEmergent

source: mdpi.com

This is just a high-level look at how a typical CNN is architected. There may be many variations that experts will use in practice to tune their network for their particular use cases. This is where the expertise comes into play. You may need to “tune” your network if the initial training does not produce as accurate of results as you had hoped. This process is called “Hyperparameter Tuning” and I will have to write another whole article just covering that. For now, familiarize yourself with the basics of ANNs and CNNs and come back soon to read about hyperparameter tuning in the near future!

As always, thanks so much for reading! Please tell me what you think or would like me to write about next in the comments. I’m open to criticism as well!

If you take the time to “like” or “share” the article, that would mean a lot to me. I write for free on my own time because I enjoy talking about technology and the more people that read my articles, the more individuals I get to geek out with!

Thanks and have a great day!

From Fiction to Reality: A Beginner’s Guide to Artificial Neural Networks — June 12, 2017

From Fiction to Reality: A Beginner’s Guide to Artificial Neural Networks

Interest in artificial intelligence is reaching new heights. 2016 was a record year for AI startups and funding, and 2017 will certainly surpass it, if it hasn’t already. According to IDC, spending on cognitive systems and AI will rise more than 750% by 2020. Both interest and investment in AI spans the full spectrum of the business and technology landscape, from the smallest startups to the largest corporations, from smartphone apps to public health safety systems. The biggest names in technology are all investing heavily in AI, while baking it into their business models and using it increasingly in their offerings: virtual assistants (e.g. Siri), computer vision, speech recognition, language translation and dozens of other applications. But what is the actual IT behind AI? In short: artificial neural networks (ANN). Here we take a look at what they are, how they work, and how they relate to the biological neural networks that inspired them.

Defining an Artificial Neural Network

The term artificial neural network is used either to refer to a mathematical model or to an actual program that mimics the essential computational features found in the neural networks of the brain.

The Neuron

neuron_anatomy
source: http://www.robots.ox.ac.uk

Although biological neurons are extremely complicated cells, their essential computational nature in terms of inputs and outputs is relatively straightforward. Each neuron has multiple dendrites and a single axon. The neuron receives its inputs from its dendrites and transmits its output through its axon. Both inputs and outputs take the form of electrical impulses. The neuron sums up its inputs, and if the total electrical impulse strength exceeds the neuron’s firing threshold, the neuron fires off a new impulse along its single axon. The axon, in turn, distributes the signal along its branching synapses which collectively reach thousands of neighboring neurons.

Biological vs Artificial Neurons

biologicalVsArtificialNeuron
source: DataCamp

There are a few basic similarities between neurons and transistors. They both serve as the basic unit of information processing in their respective domains; they both have inputs and outputs; and they both connect with their neighbors. However, there are drastic differences between neurons and transistors as well. Transistors are simple logic gates generally connected to no more than three other transistors. Neurons, by contrast, are highly complex organic structures connected to roughly 10,000 other neurons. Naturally, this rich network of connections gives neurons an enormous advantage over transistors when it comes to performing cognitive feats that require thousands of parallel connections. For decades, engineers and developers have envisioned ways to capitalize on this advantage by making computers and applications operate more like brains. Finally, their ideas have made their way into the mainstream. Although transistors themselves will not look like neurons anytime soon, some of the AI software they run can now mimic basic neural processing, and it’s only getting more sophisticated.

Modeling the Neuron

The perceptron

neuron_model
source: http://cs231n.github.io/neural-networks-1/

The perceptron, or single-layer neural network, is the simplest model of neural computation, and is the ideal starting point to build upon. You can think of a perceptron as a single neuron. However, rather than having dendrites, the perceptron simply has inputs: x1, x2, x3,…,xN. Moreover, rather than having an axon, the perceptron simply has a single output: y = f(x).

Weights

Each of the perceptron’s inputs (x1, x2, x3,…,xN) has a weight (w1,w2,w3,…,wN). If a particular weight is less than 1, it will weaken its input. If it’s greater than 1, it will amplify it. In a slightly more complex, but widely-adopted, model of the perceptron, there is also an Input 1, with fixed weight b, which is called the bias and serves as the target value used for training the perceptron.

The activation function

Also called a transfer function, the activation function determines the value of the perceptron’s output. The simplest form of activation function is a certain type of step function. It mimics the biological neuron firing upon reaching its firing threshold by outputting a 1 if the total input exceeds a given threshold quantity, and outputting a 0 otherwise. However, for a more realistic result, one needs to use a non-linear activation function. One of the most commonly used is the sigmoid function:

f(x)= 1 / (1+e^-x)

There are many variations on this basic formula that are in common use. However, all sigmoid functions will adopt some form of S-curve when plotted on a graph. When the inputs are zero, the output is zero. As the input values become positive, however, the output initially increases (roughly) exponentially, but eventually maxes out at a fixed value represented by a horizontal asymptote. This maximum output value reflects the maximum electrical impulse strength that a biological neuron can generate.

Adding Hidden Layers

In more complex, realistic neural models, there are at least three layers of units: an input layer, an output layer, and one or more hidden layers. The input layer receives the raw data from the external world that the system is trying to interpret, understanding, perceive, learn, remember, recognize, translate and so on. The output layer, by contrast, transmits the network’s final, processed response. The hidden layers that reside between the input and the output layers, however, serve as the key to the machine learning that drives the most advanced artificial neural networks.

Most modeling assumes that the respective layers are fully connected. In a fully connected neural network, all the units in one layer are connected to all the units in their neighboring layers.

 

Backpropagation

You can think of backpropagation as the process in neural networks that allow the network to learn. During backpropagation the network is in a continual process of training, learning, adjusting and fine-tuning itself until it gets closer to the intended output. Backpropagation optimizes by comparing the intended output to the actual output using a loss function. The result is an error value, or cost, which backpropagation uses to re-calibrate the networks weights between neurons (to find the most relevant features and inputs that result in the desired output), usually with the help of the well-known gradient descent optimization algorithm. If there are hidden layers, then the algorithm re-calibrates the weights of all the hidden connections as well. After each round of re-calibration, the system runs again. As error rates get smaller, each round of re-calibration becomes more refined. This process may need to repeat thousands of times before the output of the backpropagation network closely matches the intended output. At this point, one can say that the network is fully trained.

Thinking It Through

With AI investment and development reaching new heights, this is an exciting time for AI enthusiasts and aspiring developers. However, it’s important to first take a good look at the IT behind AI: artificial neural networks (ANN). These computational models mimic the essential computational features found in biological neural networks. Neurons become perceptrons or simply units; dendrites become inputs; axons become outputs; electrical impulse strengths become connection weights; the neuron’s firing strength function becomes the unit’s activation function; and layers of neurons become layers of fully-connected units. Putting it all together, you can run your fully-trained feed-forward network as-is, or you can train and optimize your backpropagation network to reach a desired value. Soon you’ll be well on your way to your first image recognizer, natural language processor, or whatever new AI app you dream up.

Thanks for reading, and if you liked this please share this post or subscribe to my blog at JasonRoell.com or follow me on LinkedIn where I post about technology topics that I think are interesting for the general programmer or even technology enthusiast to know.

Have a great day and keep on learning!