The Curious Programmer

Software, Gadgets, Books, and All Things Geek

The Top 100 “AI” Terms Every Developer Needs to Know — June 1, 2023

The Top 100 “AI” Terms Every Developer Needs to Know

If you’re like me, you probably have a hard time keeping up with all the new buzzwords and acronyms that are popping up in the world of technology. Machine learning, artificial intelligence, deep learning, neural networks, natural language processing… the list goes on and on. But don’t worry, you’re not alone. In fact, according to a recent survey, only 17% of Americans can correctly define what artificial intelligence is. And that’s a problem.

Why? Because AI is not just some futuristic concept that only nerds and sci-fi fans care about. It’s a reality that is transforming every industry and every aspect of our lives. Whether you realize it or not, you’re already using AI every day. When you ask Siri or Alexa a question, when you scroll through your Facebook or Instagram feed, when you shop online or watch Netflix, when you use Google Maps or Uber, you’re interacting with AI. And that’s just the tip of the iceberg.

AI is also behind some of the most important innovations and breakthroughs of our time. It’s helping doctors diagnose diseases, farmers grow crops, teachers educate students, lawyers review contracts, artists create music, and scientists discover new planets. It’s also helping us tackle some of the biggest challenges facing humanity, such as climate change, poverty, hunger, and pandemics.

So what does this mean for you? It means that if you want to succeed in the new economy of AI, you need to familiarize yourself with the basic terminology and concepts of machine learning and artificial intelligence. You don’t need to become an expert programmer or machine learning engineer, but you do need to understand what AI can and cannot do, how it works, and how it affects you and your career.

That’s why I’ve created this blog post: to give you a quick and easy introduction to the most essential terms and concepts of machine learning and artificial intelligence. By the end of this post, you’ll be able talk confidently about AI developments and techniques with your newfound knowledge and confidence. You’ll also be able to spot the opportunities and challenges that AI presents for your industry and profession. And most importantly, you’ll be able to make informed decisions about how to leverage AI for your own benefit and growth.

So let’s get started!

The List

I’ve hand picked these as the most important and most relevant at this point in time and ones that are more general than specific to certain areas of machine learning. I may choose to update this list as it (undoubtably) changes. If I’ve missed any you believe should be included, please leave the term and short definition in the comments and we’ll all be smarter from it!

I’ve tried to keep the definitions very “short and sweet” (there are entire books written on each of them), but I encourage you to dive deeper yourself if any of these catch your interest.

  1. Algorithm: A set of rules or instructions followed by the machine learning model to learn patterns in data.
  2. Artificial Intelligence (AI): The broad discipline of creating intelligent machines.
  3. Backpropagation: A method used in artificial neural networks to calculate the gradient that is needed in the calculation of the weights to be used in the network.
  4. Bias: The simplifying assumptions made by the model to make the target function easier to approximate.
  5. Big Data: Large amounts of data that traditional data processing software can’t manage.
  6. Binary Classification: A type of classification task where each input sample is classified into one of two possible categories.
  7. Boosting: A machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning.
  8. Categorical Data: Data that can be divided into multiple categories but having no order or priority.
  9. Classification: A type of machine learning model that outputs one of a finite set of labels.
  10. Clustering: The task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.
  11. Convolutional Neural Network (CNN): A type of artificial neural network that uses convolutional layers to filter inputs for useful information.
  12. Cross-Validation: A resampling procedure used to evaluate machine learning models on a limited data sample.
  13. Data Mining: The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
  14. Data Preprocessing: The process of converting raw data into a well-readable format to be used by a machine learning model.
  15. Dataset: A collection of related sets of information composed of separate elements but can be manipulated as a unit by a computer.
  16. Deep Learning: A subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
  17. Decision Trees: A decision support tool that uses a tree-like model of decisions and their possible consequences.
  18. Dimensionality Reduction: The process of reducing the number of random variables under consideration by obtaining a set of principal variables.
  19. Ensemble Learning: A machine learning paradigm where multiple models are trained to solve the same problem and combined to get better results.
  20. Epoch: One complete pass through the entire training dataset while training a machine learning model.
  21. Feature: An individual measurable property of a phenomenon being observed.
  22. Feature Engineering: The process of using domain knowledge to extract features from raw data via data mining techniques.
  23. Feature Extraction: The process of reducing the number of resources required to describe a large set of data.
  24. Feature Selection: The process of selecting a subset of relevant features for use in model construction.
  25. Gradient Descent: An optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
  26. Hyperparameter: A parameter whose value is set before the learning process begins.
  27. Imbalanced Data: A situation where the number of observations is not the same for the categories in a classification problem.
  28. K-Nearest NeNeighbors (K-NN): A simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
  29. Kernel: A function used in machine learning to transform data into a certain form.
  30. Label: The final output you get in the output layer of a neural network.
  31. Latent Variable: Variables in a statistical model that are not directly observed but are inferred or estimated from other variables that are observed.
  32. Linear Regression: A statistical method for predicting a real-valued output based on one or more input features.
  33. Logistic Regression: A classification algorithm used to predict a binary outcome based on a set of independent variables.
  34. Loss Function: A method of evaluating how well a specific algorithm models the given data.
  35. Machine Learning (ML): The scientific study of algorithms and statistical models that computer systems use to perform tasks without explicit instructions.
  36. Multi-Class Classification: A classification task with more than two classes.
  37. Naive Bayes: A classification technique based on the Bayes’ Theorem with an assumption of independence among predictors.
  38. Natural Language Processing (NLP): A field of AI that gives the machines the ability to read, understand, and derive meaning from human languages.
  39. Neural Network: A series of algorithms that endeavors to recognize underlying relationships in a set of data.
  40. Normalization: Adjusting values measured on different scales to a common scale.
  41. Outlier: A data point that differs significantly from other similar points.
  42. Overfitting: A modeling error which occurs when a function is too closely fit to a limited set of data points.
  43. Parameter: An internal characteristic or property of a model that the learning algorithm uses to make predictions.
  44. Perceptron: The simplest form of a neural network, used for binary classification.
  45. Precision: The number of True Positives divided by the number of True Positives and False Positives. It is a measure of a classifier’s exactness.
  46. Principal Component Analysis (PCA): A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables.
  47. Random Forest: An ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.
  48. Recall: The number of True Positives divided by the number of True Positives and the number of False Negatives. It is a measure of a classifier’s completeness.
  49. Regression: A set of statistical processes for estimating the relationships among variables.
  50. Reinforcement Learning (RL): An area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
  51. Regularization: A technique used to prevent overfitting by adding an additional penalty to the loss function.
  52. ReLu (Rectified Linear Unit): A commonly used activation function in neural networks and deep learning models.
  53. RNN (Recurrent Neural Network): A type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, or the spoken word.
  54. Semi-Supervised Learning: Machine learning techniques that involve training using a small amount of labeled data and a large amount of unlabeled data.
  55. SGD (Stochastic Gradient Descent): A simple and very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.
  56. Supervised Learning: A type ofmachine learning model that makes predictions based on a set of labeled examples.
  57. Support Vector Machine (SVM): A type of machine learning model used for classification and regression analysis.
  58. TensorFlow: An open-source software library for machine learning and artificial intelligence.
  59. Time Series Analysis: Techniques used to analyze time series data in order to extract meaningful statistics and other characteristics of the data.
  60. Transfer Learning: A machine learning method where a pre-trained model is used as the starting point for a different but related problem.
  61. Underfitting: A modeling error which occurs when a function is too loosely fit to the data.
  62. Unsupervised Learning: A type of machine learning model that makes predictions based on a set of unlabeled examples.
  63. Validation Set: A subset of the data set aside to adjust a model’s hyperparameters or to guide model selection.
  64. Variable: Any characteristic, number, or quantity that can be measured or counted.
  65. Weights: The parameters in a model that the machine learning algorithm learned.
  66. XGBoost: An open-source software library which provides a gradient boosting framework for C++, Java, Python, R, and Julia.
  67. Zero-Shot Learning: A machine learning concept where a model is able to predict classes that were not seen during training.
  68. Autoencoder: A type of artificial neural network used for learning efficient codings of input data.
  69. Batch Normalization: A technique for improving the performance and stability of artificial neural networks.
  70. Bias-Variance Tradeoff: The property of a model that the variance of the parameter estimates across samples can be reduced by increasing the bias in the estimated parameters.
  71. GAN (Generative Adversarial Network): An algorithmic architecture used in unsupervised learning, particularly to generate synthetic instances of data that can pass for real data.
  72. Genetic Algorithm: A method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution.
  73. Grid Search: An approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.
  74. Imputation: The process of replacing missing data with substituted values.
  75. LSTM (Long Short-Term Memory): A type of recurrent neural network capable of learning order dependence in sequence prediction problems.
  76. Multilayer Perceptron (MLP): A class of feedforward artificial neural network.
  77. One-Hot Encoding: A process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions.
  78. Overfitting: A modeling error which occurs when a function is too closely fit to a limited set of data points.
  1. Polynomial Regression: A type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial.
  2. Quantum Machine Learning: The interdisciplinary area combining quantum physics and machine learning.
  3. Q-Learning: A reinforcement learning technique used to find the optimal action-selection policy using a q function.
  4. Regular Expression (RegEx): A sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.
  5. Reinforcement Learning: An area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
  6. Sequential Model: A type of model used in machine learning which consists of a linear stack of layers.
  7. Softmax Function: A function that takes an N-dimensional vector of real numbers and transforms it into a vector of real number in range (0,1) which add up to 1.
  8. State-Action-Reward-State-Action (SARSA): An algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning.
  9. T-distributed Stochastic Neighbor Embedding (t-SNE): A machine learning algorithm for visualization based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and his students.
  10. Univariate Analysis: The simplest form of analyzing data. “Uni” means “one”, so in other words, your data has only one variable.
  11. Variance: A statistical measurement of the spread between numbers in a data set.
  12. Word2Vec: A group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
  13. Yann LeCun: A computer scientist with contributions to convolutional neural networks and other areas of machine learning and computational neuroscience.
  14. Z-score: The number of standard deviations by which the value of a raw score is above or below the mean value of what is being observed or measured.
  15. One-shot Learning: The object categorization problem when only one single training example is given.
  16. Manifold Learning: A class of unsupervised estimators for non-linear dimensionality reduction.
  17. Denoising Autoencoder: A type of autoencoder, which is designed to remove noise from data.
  18. Curse of Dimensionality: A term that is used to describe the difficulty of training models on data with high dimensionality (large number of features).
  19. Collaborative Filtering: A technique used by some recommendation systems. In collaborative filtering, algorithms are used to make automatic predictions about the interests of a user by collecting preferences from many users.
  20. Multi-task Learning: A type of machine learning where multiple learning tasks are solved at the same time while exploiting commonalities and differences across tasks.
  21. Perceptual Hashing (pHash): A technique to convert multimedia content (images, text, video) into a manageable hash value.
  22. Generative Model: A type of machine learning model that generates new data that is similar to the training data.

That should get you started! If you liked this, subscribe to get the latest content on AI and Engineering! Cheers!

🔮 Unveiling the Hidden Magic: How a URL Transforms into a Webpage 🧙 — March 17, 2023

🔮 Unveiling the Hidden Magic: How a URL Transforms into a Webpage 🧙

Hello, dear readers! Today I’m going to explain one of the most basic and fascinating concepts of the web world: what happens when you navigate to a URL in a browser and hit “Enter”. You probably do this every day without thinking much about it, but behind the scenes there is a lot of magic going on. Let’s dive into it!

What is a URL?

First of all, let’s clarify what a URL is. URL stands for Uniform Resource Locator and it is basically an address that tells your browser where to find the information you want on the Internet. A URL has different parts that have different meanings. For example:

https://example.com/page1

In this URL, the first part https tells your browser which protocol to use for communication. A protocol is a set of rules that define how data is exchanged over the network. There are different protocols for different purposes, such as http, https, ftp, etc. In this case, https means that the communication will be secure and encrypted.

The second part example.com is called the domain name and it identifies the server that hosts the website you want to visit. A server is a powerful computer that stores web files and responds to requests from browsers. Each server has a unique address called an IP address that consists of four numbers separated by dots, such as 203.0.113.0. However, these numbers are hard to remember and type, so we use domain names instead.

The third part /page1 is called the path and it specifies which page or resource on the website you want to access. A website can have multiple pages or resources such as images, videos, scripts, etc., each with its own path.

What happens when you hit “Enter”?

Now that we know what a URL is made of, let’s see what happens when you hit “Enter” after typing it in your browser.

Step 1: DNS lookup

The first thing your browser does is to look up the IP address of the domain name using a service called DNS (Domain Name System). DNS is like a phone book for the Internet that maps domain names to IP addresses. Your browser contacts a DNS server (usually provided by your Internet Service Provider) and asks for the IP address of example.com. The DNS server responds with something like 203.0.113.0.

Step 2: TCP connection

The next thing your browser does is to establish a TCP (Transmission Control Protocol) connection with the server at 203.0.113.0. TCP is another protocol that ensures reliable and ordered delivery of data over the network. Your browser initiates a three-way handshake with the server:

  • Your browser sends a SYN (synchronize) packet to the server asking for permission to start communication.
  • The server replies with a SYN-ACK (synchronize-acknowledge) packet granting permission.
  • Your browser sends an ACK (acknowledge) packet back confirming receipt.

This way, both your browser and the server agree on some parameters such as port numbers and sequence numbers for data transmission.

Step 3: HTTPS handshake

If your URL starts with https, then your browser also performs an HTTPS (Hypertext Transfer Protocol Secure) handshake with
the server before sending any data. HTTPS adds another layer of security on top of TCP by encrypting all data using SSL/TLS (Secure Sockets Layer/Transport Layer Security) protocols.

Your browser initiates an HTTPS handshake with these steps:

  • Your browser sends a ClientHello message to the server indicating its supported SSL/TLS versions and cipher suites (encryption algorithms).
  • The server replies with a ServerHello message choosing one SSL/TLS version and cipher suite from those offered by your browser.
  • The server also sends its digital certificate signed by a trusted Certificate Authority (CA) proving its identity.
  • Your browser verifies the certificate against its list of trusted CAs and checks if it matches with example.com.
  • If everything checks out, your browser generates a random symmetric key for encryption and sends it to
    the server encrypted with its public key.
  • The server decrypts this key using its private key and sends back an encrypted Finished message indicating readiness.
  • Your browser decrypts this message using its symmetric key and sends back another encrypted Finished message confirming completion.

This way both your browser and
the server agree on an encryption key for secure communication.

Step 4: HTTP request

Now that your browser has established both TCP
and HTTPS connections, it’s time to send the actual HTTP (Hypertext Transfer Protocol) request to the server. The request contains the following information:

  • The HTTP method (usually GET for retrieving data or POST for submitting data)
  • The path of the resource you want to access (/page1)
  • The HTTP version (usually HTTP/1.1 or HTTP/2)
  • Additional headers that provide more information about your browser, the type of content it accepts, cookies, etc.

Here’s an example of an HTTP GET request:
`

GET /page1 HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

Step 5: Server processing

Upon receiving the HTTP request, the server processes it and generates an appropriate response. This may involve querying databases, executing server-side scripts, or fetching static files, depending on the requested resource. Once the server has prepared the response, it sends it back to your browser over the established TCP and HTTPS connections.

Step 6: HTTP response

The server’s response is also an HTTP message with the following information:

  • The HTTP version (e.g., HTTP/1.1 or HTTP/2)
  • The status code indicating the result of the request (e.g., 200 OK for success, 404 Not Found for a missing resource)
  • Additional headers providing more information about the server, the content type, the content length, etc.
  • The actual content (HTML, images, videos, etc.) of the requested resource

Here’s an example of an HTTP 200 OK response:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 12345

<!DOCTYPE html>
<html>
<head>
<title>Page 1</title>
...

Step 7: Rendering the page

Now that your browser has received the response, it starts parsing the HTML content and rendering the page on your screen. This involves several sub-steps:

  1. The browser builds the DOM (Document Object Model), a tree-like structure representing the HTML elements and their hierarchy.
  2. The browser retrieves and applies CSS (Cascading Style Sheets) rules to style the DOM elements.
  3. The browser executes JavaScript code (if any) that may manipulate the DOM, fetch additional resources, or provide interactivity.
  4. The browser calculates the layout and position of each DOM element based on the CSS rules and the available screen space.
  5. The browser paints the final representation of the page on your screen, including images, videos, and other media elements.

Step 8: Closing the connection

Once the page is fully rendered, your browser may close the TCP and HTTPS connections to the server, unless you have enabled HTTP Keep-Alive or are using HTTP/2 multiplexing features that allow multiple requests and responses to share the same connection.

And that’s it! You’ve now seen the intricate dance of all the underlying events that occur every time you request more cat memes from a site…

Well, sort of 😬

In actuality, this is still a relatively high level view of all of the technologies, events, and protocols that make the internet possible. We didn’t cover a lot of interesting details that allow for each of the topics explained above to even be possible. But this is a blog post… not a book!

If you are interested in learning all of those fascinating details in a fun and engaging “comic book” style type of writing, then check out this book on How the Internet Really Works, which I highly recommend:

How the Internet Really Works

internet

It’s amazing how much is going on behind the scenes, and understanding these details can help you appreciate the marvel of modern web technologies. So the next time you visit a website, remember the intricate ballet of protocols, connections, and data transfers that make it all possible.

Quantum Computing and AI Tie the Knot — April 13, 2018

Quantum Computing and AI Tie the Knot

In 2018, quantum technicians and daring developers are using quantum algorithms to transform the field of artificial neural network optimization: the bees knees of machine learning and AI. So we can say with some confidence that thanks to quantum algorithms, the future of quantum computing and artificial intelligence are hopelessly entangled. So let’s take a deep dive into the quantum algorithms that are making waves in the digital age. I’ll be paying special attention to quantum annealing (rhymes with feeling), a unique animal that seems to thrive in an AI-rich area where classical algorithms often struggle or altogether fail: training artificial neural networks.

Trouble training your neural net? Join the club…

Rather amazingly, you can train artificial neural nets such as RNNs and CNNs to get wise and not make the same mistake twice. It’s this power to follow Esther Dyson’s advice that makes neural nets the intelligence engine that drives machine learning and AI. That said, training neural networks is a notoriously tricky task. But this hasn’t stopped researchers and coders from working furiously over the last few years to find new ways to reduce training errors with bleeding-edge optimization algorithms. The first stab at the error-reduction problem is best known as hill climbing. Let’s run through it.

Hill climbing

Optimization algorithms that belong to the hill climbing club always check for the gradient (more or less the steepness of a graphed function’s slope) before making their next move. But this runs a real risk of missing out on the real action going on in the graph’s landscape. Two enemies hill climbers often find themselves facing are the plateau problem and the local minima problem. In a word, these problems are the hiker’s equivalent to getting lost in a mirage-riddled desert, or getting stuck in a small muddy valley. But let’s dig deeper…

blog1.png

The plateau problem

When an optimization process enters a plateau, it means it’s getting roughly the same output (y) for every input (x). Because the slope of the function is at or near zero for long flat stretches, an optimization algorithm can run out of time before it finds the edge. And like a shimmering desert mirage, the long stretches of flat function can create the illusion that you’ve reached an optimal state (in this case the global minimum) when you’re nowhere near it.

blog2.png

The local minima problem

A local minimum is a relatively small valley in the graph of a function whose deepest and most important valley lies elsewhere. You can think of the optimization process (when it’s searching for the lowest value in a function) as a beach ball: it will roll downhill and eventually stop at the lowest point in the immediate landscape, even if there’s a much deeper valley on the other side of a nearby hill. That’s the problem.

blog3.png

The most exciting solution

There are a number of alternatives to conventional hill climbing that can help you get out of the dreaded valleys posed by the local minima problem and the desert mirages posed by the plateau problem. But for the purposes at hand, let’s just focus on the most exciting solution: simulated annealing. This is a wild breed of optimization animal that is tackling valleys and plateaus in a computationally clever way that’s well worth at least a couple paragraphs of pondering…

The hottest and coolest classical optimization algorithm around

To cut to the chase, simulated annealing steals from physics to tie time and temperature together in a single elegant algorithm. Yes, you read that right: an algorithm with a temperature parameter. When you run a simulated annealing algorithm, it begins with a completely random, frenetic series of selections from the entire landscape of the function at hand. This is the hot phase of the process. But as the temperature parameter drops with time, the random selections cover an ever-narrower range of the landscape. Finally, we enter the cool phase of the process as the algorithm begins to home in on (with a little luck) the deepest valley or the highest peak, where the holy grail of optimization lies: the global minimum or maximum.

Although the code found in simulated annealing algorithms generally contains some heavy math, the underlying connection between time and temperature is quite easy to grasp. Just picture something hot moving wildly throughout an unknown landscape, hitting everything in sight and reporting back a very rough picture of the lay of the land. Then you can think of progressively colder things moving ever-more slowly and cautiously through an ever-narrower region of the landscape, documenting the details as they creep further down into the deepest valley or up onto the highest peak… Okay, if you’re still not sure what on earth I’m talking about, here’s an excellent animation that should do the trick.

Quantum annealing (oh, what a feeling)

Simulated annealing can often get you out of a pinch when the other alternatives to conventional hill climbing come up short. But it’s an extremely specialized approach, and it suffers from at least one chilling drawback: you have to run the algorithm for an infinite amount of time to smoothly reach absolute zero and thus guarantee that you reach the true global minimum or maximum in the energy landscape. Since you probably don’t have an eternity to spare, you will never really know if your optimization solution is caught in yet another trap.

Enter quantum annealing. First, it’s important to keep in mind that quantum annealing algorithms in their basic form are remarkably similar to simulated annealing algorithms. Why? Because quantum tunneling strength plays the same role in quantum annealing as temperature does in simulated annealing. As time passes, the quantum tunneling strength in the quantum annealer drops dramatically, just as the temperature in the simulated annealer drops dramatically. It’s also easy to visualize the similarity between tunneling-strength and temperature. As time passes and quantum tunneling strength decreases, the system gets cozier and cozier with each progressively deeper valley in the energy landscape, and less and less inclined to tunnel its way out. Eventually, it gives up tunneling altogether when it finds itself (ideally) at the bottom of the deepest and coziest valley in the energy landscape (AKA the global minimum).

blog4.jpeg

Not your grandmother’s quantum computer

The first difference your bound to notice between relatively conventional quantum computers and quantum annealing computers is the number of qubits they use. While the state-of-the-art in conventional quantum computers is pushing a few dozen qubits in 2018, the leading quantum annealer has more than 2000 qubits. Of course, the trade-off is that quantum annealers are not universal but specialized quantum computers that technically tackle only optimization problems and sampling problems. Because solving optimization problems is considered one of the key paths to the AI promised land, I’m going to focus on it from here on out.

A dizzying state of disarray

Before we apply the quantum annealing algorithm to the pool of qubits in our quantum annealer, they’re a mess: a maximally cloudy and unconnected configuration. This means we start out knowing nothing about the quantum system, which may be in any of 2n different states (where n is the number of qubits). For a quantum annealer with 2000 qubits, that’s a crazy number of possible states. If you have any doubts about that, try plugging 2²⁰⁰⁰ into your favorite calculator for a second opinion.

The quantum wishing well

Individual qubits always start out in an initial state of cloudy superposition that places them at the minimum possible energy. Physicists like to visualize this lowest-energy state as the bottom of a quantum potential well that looks sort of like a big letter U.

U

0/1

Then quantum annealing comes along and forces the state of superposition into two halves, two states, two bottoms of the well: 0 and 1. The result looks more like a big letter W:

W

0 1

The next step for the quantum annealer is to start loading the dice to favor the house in the quantum probability game.

Biases

With the help of an applied magnetic field, the quantum annealer nudges each qubit into being heavily biased toward 0 or 1: favoring either the first or the second dip in the W above.

Couplings

While quantum annealers are loading the dice (that is, individual qubits) with biases via magnetic fields, they are also busy tying together pairs of dice with theoretical threads via couplers. Specifically, a coupler can do one of two things. It can guarantee that a pair of qubits are always in the same state: either both 0 or both 1Or it can ensure that two neighboring qubits are always in the opposite state: 0 and 1, or 1 and 0. The quantum coupler uses (surprise, surprise) quantum entanglement to tie qubits together and create the couplings.

Sculpting an energy landscape

As an aspiring developer working with a quantum annealer, it’s your job to essentially load all the quantum dice by coding a collection of biases and couplings that define the optimization problem you want your trusty annealer to solve. Another way to look at it is that you are sculpting, or at least generating, a sophisticated energy landscape of peaks and valleys that represent all possible outcomes in your optimization problem. Then you are setting the quantum annealer loose to search and ferret out the very bottom of that energy landscape’s deepest valley, which corresponds to the optimal solution. If you’re consistently successful, then your quantum annealing prowess may help power a new generation of machine learning and AI for posterity.

Quantum computing and AI news

On August 31st, 2017, the Universities Space Research Association (USRA) announced that in partnership with NASA and Google it had upgraded the quantum annealing computer at the Quantum Artificial Intelligence Lab(Quantum AI Lab) to a D-Wave 2000Q. With nearly twice as many qubits as its predecessor and a new knack for “adiabatic quantum computing,” the latest D-Wave is going after bigger fish in the optimization-problem pond. The USRA team has even got their eye on using quantum algorithms and the D-Wave to tackle “challenging computational problems involved in NASA missions.” Partner Google, on the other hand, has their eye on AI:

“We are particularly interested in applying quantum computing to artificial intelligence and machine learning.”

But it’s not just Google and NASA that have access to the Quantum AI Lab. Believe it or not, you may too. If you’re a qualified candidate, you might just get some quality time with the latest D-Wave to try out your genius idea. In the Lab’s own words, “the call is open.”

If you liked this article I would be super excited if you could share with your curious friends. Anyway, thanks again for reading have a great day!

Demystifying Quantum Gates — One Qubit At A Time — February 27, 2018

Demystifying Quantum Gates — One Qubit At A Time

(I’ve written an introduction to quantum computing found here. If you are brand new to the field, it will be a better place to start.)

If you want to get into quantum computing, there’s no way around it: you will have to master the cloudy concept of the quantum gate. Like everything in quantum computing, not to mention quantum mechanics, quantum gates are shrouded in an unfamiliar fog of jargon and matrix mathematics that reflects the quantum mystery. My goal in this post is to peel off a few layers of that mystery. But I’ll save you the suspense: no one can get rid of it completely. At least, not in 2018. All we can do today is reveal the striking similarities and alarming differences between classical gates and quantum gates, and explore the implications for the near and far future of computing.

Classical vs quantum gates: comparing the incomparable?

Striking similarities

If nothing else, classical logic gates and quantum logic gates are both logic gates. So let’s start there. A logic gate, whether classical or quantum, is any physical structure or system that takes a set of binary inputs (whether 0s and 1s, apples and oranges, spin-up electrons and spin-down electrons, you name it) and spits out a single binary output: a 1, an orange, a spin-up electron, or even one of two states of superposition. What governs the output is a Boolean function. That sounds fancy and foreboding, but trust me, it’s not. You can think of a Boolean function as nothing more than a rule for how to respond to Yes/No questions. It’s as simple as that. The gates are then combined into circuits, and the circuits into CPUs or other computational components. This is true whether we’re talking about Babbage’s Difference EngineENIAC, retired chess champion Deep Blue, or the latest room-filling, bone-chilling, headline-making quantum computer.

Alarming differences

Classical gates operate on classical bits, while quantum gates operate on quantum bits (qubits). This means that quantum gates can leverage two key aspects of quantum mechanics that are entirely out of reach for classical gates: superposition and entanglement. These are the two concepts that you’ll hear about most often in the context of quantum computing, and here’s why. But there’s a lesser known concept that’s perhaps equally important: reversibility. Simply put, quantum gates are reversible. You’ll learn a lot about reversibility as you go further into quantum computing, so it’s worth really digging into it. For now, you can think of it this way — all quantum gates come with an undo button, while many classical gates don’t, at least not yet. This means that, at least in principle, quantum gates never lose information. Qubits that are entangled on their way into the quantum gate remain entangled on the way out, keeping their information safely sealed throughout the transition. Many of the classical gates found in conventional computers, on the other hand, do lose information, and therefore can’t retrace their steps. Interestingly enough, that information is not ultimately lost to the universe, but rather seeps out into your room or your lap as the heat in your classical computer.

V is for vector

We can’t talk about quantum gates without talking about matrices, and we can’t talk about matrices without talking about vectors. So let’s get on with it. In the language of quantum mechanics and computing, vectors are depicted in an admittedly pretty weird package called a ket, which comes from the second half of the word braket. And they look the part. Here’s a ket vector: |u>, where u represents the values in the vector. For starters, we’ll use two kets, |0> and |1>, which will stand-in for qubits in the form of electrons in the spin-up (|0>) and spin-down (|1>) states. These vectors can span any number of numbers, so to speak. But in the case of a binary state such as a spin up/down electron qubit, they have only two. So instead of looking like towering column vectors, they just looked like numbers stacked two-high. Here’s what |0> looks like:

/ 1 \

\ 0 /

Now, what gates/matrices do is transform these states, these vectors, these kets, these columns of numbers, into brand new ones. For example, a gate can transform an up-state (|0>) into a down state (|1>), like magic:

/ 1 \ → / 0 \

\ 0 / \ 1 /

M is for matrix

This transformation of one vector into another takes place through the barely understood magic of matrix multiplication, which is completely different than the kind of multiplication we all learned in pre-quantum school. However, once you get the hang of this kind of math, it’s extremely rewarding, because you can apply it again and again to countless otherwise incomprehensible equations that leave the uninitiated stupefied. If you need some more motivation, just remember that it was through the language of matrix mathematics that Heisenberg unlocked the secrets of the all-encompassing uncertainty principle.

All the same, if you’re not familiar with this jet-fuel of a mathematical tool, your eyes will glaze over if I start filling this post with big square arrays of numbers at this point. And we can’t let that happen. So let’s wait a few more paragraphs for the matrix math and notation. Suffice it to say, for now, that we generally use a matrix to stand-in for a quantum gate. The size and outright fear-factor of the matrix will depend on the number of qubits it’s operating on. If there’s just one qubit to transform, the matrix will be nice and simple, just a 2 x 2 array with four elements. But the size of the matrix balloons with two, three or more qubits. This is because a decidedly exponential equation that’s well worth memorizing drives the size of the matrix (and thus the sophistication of the quantum gate):

2^n x 2^n = the total number of matrix elements

Here, n is the number of qubits the quantum gate is operating on. As you can see, this number goes through the roof as the number of qubits (n) increases. With one qubit, it’s 4. With two, it’s 16. With three, it’s 64. With four, it’s… hopeless. So for now, I’m sticking to one qubit, and it’s got Pauli written all over it.

The Pauli gates

The Pauli gates are named after Wolfgang Pauli, who not only has a cool name, but has managed to immortalize himself in two of the best-known principles of modern physics: the celebrated Pauli exclusion principle and the dreaded Pauli effect.

The Pauli gates are based on the better-known Pauli matrices (aka Pauli spin matrices) which are incredibly useful for calculating changes to the spin of a single electron. Since electron spin is the favored property to use for a qubit in today’s quantum gates, Pauli matrices and gates are right up our alley. In any event, there’s essentially one Pauli gate/matrix for each axis in space (X, Y and Z).

So you can picture each one of them wielding the power to change the direction of an electron’s spin along their corresponding axis in 3D space. Of course, like everything else in the quantum world, there’s a catch: this is notour ordinary 3D space, because it includes an imaginary dimension. But let’s let that slide for now, shall we?

Mercifully, the Pauli gates are just about the simplest quantum gates you’re ever going to meet. (At least the X and Z-gates are. The Y is a little weird.) So even if you’ve never seen a matrix in your life, Pauli makes them manageable. His gates act on one, and only one, qubit at a time. This translates to simple, 2 x 2 matrices with only 4 elements a piece.

The Pauli X-gate

The Pauli X-gate is a dream come true for those that fear matrix math. No imaginary numbers. No minus signs. And a simple operation: negation. This is only natural, because the Pauli X-gate corresponds to a classical NOT gate. For this reason, the X-gate is often called the quantum NOT gate as well.

In an actual real-world setting, the X-gate generally turns the spin-up state |0> of an electron into a spin-down state |1> and vice-versa.

|0>   -->   |1>   OR   |1> --> |0>

A capital “X” often stands in for the Pauli X-gate or matrix itself. Here’s what Xlooks like:

/ 0 1 \

\ 1 0 /

In terms of proper notation, applying a quantum gate to a qubit is a matter of multiplying a ket vector by a matrix. In this case, we are multiplying the spin-up ket vector |0> by the Pauli X-gate or matrix X. Here’s what X|0> looks like:

/ 0 1 \ /1\

\ 1 0 / \0/

Note that you always place the matrix to the left of the ket. As you may have heard, matrix multiplication, unlike ordinary multiplication, does not commute, which goes against everything we were taught in schoolIt’s as if 2 x 4 was not always equal to 4 x 2. But that’s how matrix multiplication works, and once you get the hang of it, you’ll see why. Meanwhile, keeping the all-important ordering of elements in mind, the complete notation for applying the quantum NOT-gate to our qubit (in this case the spin-up state of an electron), looks like this:

X|0> = / 0 1 \ /1\ = /0\ = |1>

\ 1 0 / \0/ \1/

Applied to a spin-down vector, the complete notation looks like this:

X|1> = / 0 1 \ /0\ = /1\ = |0>

\ 1 0 / \1/ \0/

Despite all the foreign notation, in both of these cases what’s actually happening here is that a qubit in the form of a single electron is passing through a quantum gate and coming out the other side with its spin flipped completely over.

The Pauli Y and Z-gates

I’ll spare you the math with these two. But you should at least know about them in passing.

Of the three Pauli gates, the Pauli Y-gate is the fancy one. It looks a lot like the X-gate, but with an i (yep, the insane square root of -1) in place of the regular 1, and a negative sign in the upper right. Here’s what Y looks like:

/ 0 -i \

i 0 /

The Pauli Z-gate is far easier to follow. It looks kind of like a mirror image of the X-gate above, but with a negative sign thrown into the mix. Here’s what Zlooks like:

/ 1 0 \

\ 0 -1 /

The Y-gate and the Z-gate also change the spin of our qubit electron. But I’d probably need to delve into the esoteric mysteries of the Bloch sphere to really explain how, and I’ve got another gate to go through at the moment…

The Hadamard gate

While the Pauli gates are a lot like classic logic gates in some respects, the Hadamard gate, or H-gate, is a bona fide quantum beast. It shows up everywhere in quantum computing, and for good reason. The Hadamard gate has the characteristically quantum capacity to transform a definite quantum state, such as spin-up, into a murky one, such as a superposition of both spin-up and spin-down at the same time.

Once you send a spin-up or spin-down electron through an H-gate, it will become like a penny standing on its end, with precisely 50/50 odds that it will end up heads (spin-up) or tails (spin-down) when toppled and measured. This H-gate is extremely useful for performing the first computation in any quantum program because it transforms pre-set, or initialized, qubits back into their natural fluid state in order to leverage their full quantum powers.

Other quantum gates

There are a number of other quantum gates you’re bound to run into. Many of them operate on several qubits at a time, leading to 4×4 or even 8×8 matrices with complex-numbered elements. These are pretty hairy if you don’t already have some serious matrix skills under your belt. So I’ll spare you the details.

The main gates that you will want to be familiar are the ones we covered shown in the graph below:

You should know that other gates exist so here’s a quick list of some of the most widely used other quantum gates, just so you can get a feel for the jargon:

  • Toffoli gateFredkin gate
  • Deutsch gate
  • Swap gate (and swap-gate square root)
  • NOT-gate square root
  • Controlled-NOT gate (C-NOT) and other controlled gates

There are many more. But don’t let the numbers fool you. Just as you can perform any classical computation with a combination of NOT + OR = NOR gates or AND + OR = NAND gates, you can reduce the list of quantum gates to a simple set of universal quantum gates. But we’ll save that deed for another day.

Future gazing through the quantum gateway

As a recent Quanta Magazine article points out, the quantum computers of 2018 aren’t quite ready for prime time. Before they can step into the ring with classical computers with billions of times as many logic gates, they will need to face a few of their own demons. The most deadly is probably the demon of decoherence. Right now, quantum decoherence will destroy your quantum computation in just “a few microseconds.” However, the faster your quantum gates perform their operations, the more likely your quantum algorithm will beat the demon of decoherence to the finish line, and the longer the race will last. Alongside speed, another important factor is the sheer number of operations performed by quantum gates to complete a calculation. This is known as a computation’s depth. So another current quest is to deepen the quantum playing field. By this logic, as the rapidly evolving quantum computer gets faster, its calculations deeper, and the countdown-to-decoherence longer, the classical computer will eventually find itself facing a formidable challenger, if not successor, in the (quite possibly) not too far future.

If you liked this article I would be super excited if you hit the like button 🙂 or share with your curious friends. You can subscribe to this profile and get all my articles sent to you as soon as I write them by clicking the subcribe button! (how awesome?!)

Anyway, thanks again for reading have a great day!

The Need, Promise, and Reality of Quantum Computing — February 1, 2018

The Need, Promise, and Reality of Quantum Computing

Despite giving us the most spectacular wave of technological innovation in human history, there are certain computational problems that the digital revolution still can’t seem to solve. Some of these problems could be holding back key scientific breakthroughs, and even the global economy. Although conventional computers have been doubling in power and processing speed nearly ever two years for decades, they still don’t seem to be getting any closer to solving these persistent problems. Want to know why? Ask any computer scientist, and they’ll probably give you the same answer: today’s digital, conventional computers are built on a classical, and very limited, model of computing. In the long run, to efficiently solve the world’s most persistent computing problems, we’re going to have to turn to an entirely new and more capable animal: the quantum computer.

Ultimately, the difference between a classical computer and a quantum computer is not like the difference between an old car and a new one. Rather, it’s like the difference between a horse and a hawk: while one can run, the other can fly. Classical computers and quantum computers are indeed that different. Here we take a good look at where the key difference lies, and take a deep dive into what makes quantum computers unique. However, what you won’t find here is a final explanation for how quantum computers ultimately work their magic. Because no one really knows.

The hard limits of classical computing

Moore’s law, Shmore’s Law

For several decades now, the sheer speed and computational power of conventional computers has been doubling every two years (and by some accounts just eighteen months). This is known as Moore’s law. Although the breakneck pace of progress may have finally begun to slow slightly, it’s still more or less true that the room-filling supercomputer of today is the budget laptop of tomorrow. So at this rate, it seems reasonable to assume that there is no computational task that a conventional computer couldn’t eventually tackle in the foreseeable future. Nonetheless, unless we’re talking trillions of years (and then some), that’s simply not a safe assumption when it comes to certain stubborn tasks.

The conventional computer’s Achilles heel

The fact is that a computational task such as quickly finding the prime factors for very large integers is probably out of reach for even the fastest conventional computers of the future. The reason behind this is that finding the prime factors of a number is a function that has exponential growth. What’s exponential growth? Well let’s dive into it because this is a very important piece for understanding why quantum computers have so much potential and why classical computers fall short.

Quick introduction to exponential growth

Some things grow at a consistent rate and somethings grow faster as the number of “things” you have also grows. When growth becomes more rapid (not constant) in relation to the growing total number, then it is exponential.

Exponential growth is extremely powerful. One of the most important features of exponential growth is that, while it starts off slowly, it can result in enormous quantities fairly quickly — often in a way that is shocking.

This definition can be a bit hard to get your head around without an example, so let’s dive into a quick story.

There is a legend in which a wise man, who was promised an award by a king, asks the ruler to reward him by placing one grain of rice on the first square of a chessboard, two grains on the second square, four grains on the third and so forth. Every square was to have double the number of grains as the previous square. The king granted his request but soon realized that the rice required to fill the chessboard was more than existed in the entire kingdom and would cost him all of his assets.

Exponential Growth of Rice

The number of grains on any square reflects the following rule, or formula:

In this formula, k is the number of the square and N is the number of grains of rice on that square.

  • If k = 1 (the first square), then N = 2⁰, which equals 1.
  • If k = 5 (the fifth square), then N = 2⁴, which equals 16.

This is exponential growth because the exponent, or power, increases as we go from square to square.

To conceptualize this further, I’ve included a graph of what exponential growth looks like in relation to the input quantity of an exponential function.

As you can see, the function starts relatively slow, but soon shoots up to numbers that no classical computer would be able to compute with large enough input sizes.

Real exponential functions have real consequences

Okay, enough storytelling. Let’s move on to real-world exponential problems like the one we were talking about earlier. Prime Factorization.

Take the number 51. See how long it takes you to find the two unique prime numbers that you can multiply together to generate it. If you’re familiar with these kinds of problems, it probably only took you a few seconds to find that 3 and 17, both primes, generate 51. As it turns out, this seemingly simple process, lies at the heart of the digital economy and is the basis for our most secure types of encryption. The reason we use this technique in encryption is because as the numbers used in prime factorization get larger and larger, it becomes increasingly difficult for conventional computers to factor them. Once you reach a certain number of digits, you find that it would take even the fastest conventional computer months, years, centuries, millennia, or even countless eons to factor it.

With this idea in mind, even if computers continue to double in processing power every two years for the foreseeable future (and don’t bet on it), they will always struggle with prime factorization. Other equally stubborn problems at the heart of modern science and mathematics include certain molecular modeling and mathematical optimization problems which promise to crash any supercomputer that dares to come anywhere near them.

Below is a great illustration from IBM Research that shows the most complex molecule (F cluster) that we can simulate on our the worlds most powerful supercomputer. As you can see (in the bottom left of the image), the molecule is not very complex at all, and if we want to model more complex molecules to discover better drug treatments and understand our biology, then we will need a different approach!

Molecular Simulation Problem. Source: IBM Research

Enter the quantum computer

Conventional computers are strictly digital and rely purely on classical computing principles and properties. Quantum computers, on the other hand, are strictly quantum. Accordingly, they rely on quantum principles and properties — most importantly superposition and entanglement — that make all the difference in their almost miraculous capacity to solve seemingly insurmountable problems.

Superposition

To make sense out of the notion of superposition, let’s consider the simplest possible system: a two-state system. An ordinary, classical two-state system is like an On/Off switch that is always in one state (On) or another (Off). Yet a two-state quantum system is something else entirely. Of course, whenever you measure its state, you will find that it is indeed either on or off, just like a classical system. But between measurements, a quantum system can be in a superposition of both on and off states at the same time, no matter how counter-intuitive, and even supernatural, this may seem to us.

Superposition. Source: IBM Research

Generally speaking, physicists maintain that it’s meaningless to talk about a quantum system’s state, such as its spin, prior to measurement. Some even argue that the very act of measuring a quantum system causes it to collapse from a murky state of uncertainty to the value (On or Off, Up or Down) that you measure. Although probably impossible to visualize, there’s no escaping the fact that this mysterious phenomenon is not only real but gives rise to a new dimension of problem-solving power that paves the way for the quantum computer. Keep the idea of superposition in mind. We will come back to how this is used in quantum computing in a bit.

How superposition is even possible is beyond the scope of this article, but trust that it has been proven to be true. If you want to understand what gives rise to superposition then you are going to first need to understand the idea of Wave/Particle Duality.

Entanglement

Okay, on to the next property of quantum mechanics which we need to leverage to create a quantum computer.

It is known that once two quantum systems interact with one another, they become hopelessly entangled partners. From then on, the state of one system will give you precise information about the state of the other system, no matter how far the two are from one another. Seriously, the two systems can be light years apart and still give you precise and instantaneous information about each other. Let’s illustrate this with a concrete example as this caused even Einstein to puzzle about how this could be possible. (Einstein famously referred to this phenomenon as “Spooky action at a distance”)

Quantum Entanglement. Source: IBM Research

Suppose you have two electrons, A and B. Once you have them interact in just the right way, their spins will automatically get entangled. From then on, if A’s spin is Up, B’s spin will be Down, like two kids on a seesaw, except that this holds true even you take A and B to opposite ends of the Earth (or the galaxy, for that matter). Despite the thousands of miles (or light years) between them, it’s been proven that if you measure A to have spin Up, you will know instantly that B’s spin is Down. But wait: we’ve already learned that these systems don’t have precise values for states such as spin, but rather exist in a murky superposition, prior to measurement. So does our measuring A actually cause B to instantaneously collapse to the opposite value, even when the two are light years apart? If so, then we have yet another problem on our hands, because Einstein taught us that no causal influence, such as a light signal, between two systems can travel faster than the speed of light. So what gives? All told, we honestly don’t know. All we know is that quantum entanglement is real and that you can leverage it to work wonders.

The qubit

The qubit plays the same role in quantum computing as the bit does in classical computing: its the fundamental unit of information. However, compared to a qubit, a bit is downright boring. Although both bits and qubits generate one of two states (a 0 or a 1) as the outcome of a computation, a qubit can simultaneously be in both 0 and 1 states prior to that outcome. If this sounds like quantum superposition, it is. Qubits are quantum systems par excellence.

Just as conventional computers are built bit by bit with transistors that are either On or Off, quantum computers are built qubit by qubit with electrons in spin-states that are either Up or Down (once measured, of course). And just as transistors in On/Off states are strung together to form the logic gates that perform classical computations in digital computers, electrons in Up/Down spin-states are strung together to form the quantum gates that perform quantum calculations in quantum computers. Yet stringing together individual electrons (while preserving their spin states) is far, far easier said than done.

Quantum Algorithms. Source: IBM Research

Where are we today?

While Intel is busy pumping out conventional chips with billions of transistors a piece, the world’s leading experimental computer scientists are still struggling to build a quantum computer “chip” with more than a handful of qubits. Just to give you a sense of how early we are in the history of quantum computing, it was a big deal when recently IBM unveiled the largest quantum computer in the world with an astonishing… wait for it… 50 qubits. Nonetheless, it’s a start, and if anything like Moore’s law applies to quantum computers, we should get into the hundreds in a few years, and the thousands in a few more. A billion? I wouldn’t hold your breath, but then again, you don’t need a billion qubits to outperform the daylights out of a conventional computer in some key categories, such as prime categorization, molecular modeling and a slew of optimization problems that no conventional computer can touch today.

The quantum computers of 2018

All the same, as of right now, nearly every quantum computer is a multi-million dollar borderline mad-scientist project that looks the part. You generally find them in R&D departments at large IT companies like IBM, or in the experimental physics wing of large research universities, like MIT. They have to be super-cooled to a hair above absolute zero (that’s colder than intergalactic space), and experimenters need to use microwaves of a precise frequency to communicate with each qubit in the computer individually. Needless to say, that doesn’t scale. But neither did the vacuum tubes of the earliest conventional computers, so let’s not judge this first generation too harshly.

Roadblocks awaiting breakthroughs

The primary reason that quantum computers haven’t gone mainstream yet is that the best minds and inventors in the world are still struggling with high error rates and low numbers of qubits. As we address these two problems together, we will rapidly increase what IBM calls each computers’ “quantum volume,” a way of visualizing the sheer quantity of useful calculations a quantum computer can perform.

Quantum Volume. Source: IBM Research

In short, for quantum computing to take off and quantum-powered Macbooks to start flying off the shelves, we need far more qubits and far fewer mistakes. That’s going to take time, but at least we know what we’re aiming for, and what we’re up against.

Myths vs explanations

Although we know that quantum computers can easily do things that no conventional computer can dream of doing, we don’t really know how they do it. If this sounds surprising, given that the first-generation of quantum computers already exists, keep in mind the word quantum. We’ve been using quantum mechanics to solve problems for a century now, and we still don’t really know how it works. Quantum computing, as a member of the quantum family, is in the same boat. Michael Nielsen (who basically wrote the book on the subject), has argued convincingly that any explanation of quantum computing is destined to miss the mark. After all, according to Nielsen, if there were a straightforward explanation for how a quantum computer works (that is, something you could visualize), then it could be simulated on a conventional computer. But if it could be simulated on a conventional computer, then it couldn’t be an accurate model of a quantum computer, because a quantum computer by definition does what no conventional computer can do.

According to Nielsen, the most popular myth that pretends to explain quantum computation is called quantum parallelism. Because you’re going to hear the quantum parallelism story a lot, let’s look at it for a moment. The basic idea behind quantum parallelism is that quantum computers, unlike their conventional counterparts, explore the full spectrum of possible computational outcomes/solutions simultaneously (i.e. in a single operation), while digital computers must plod along, looking at each solution in sequence. According to Nielsen, this part of the quantum-parallelism story is roughly right. However, he sharply criticizes the rest of the story, which goes on to say that after surveying the full spectrum of solutions, quantum computers pick out the best one. Now that, according to Nielsen, is a myth. The truth, he insists, is that what quantum computers, like all quantum systems, are really doing behind the scenes is entirely out of our reach. We see the input, and the output, and what happens in between is sealed in mystery.

If you liked this article I would be super excited if you share with your curious friends. I’ve got much more like it coming and if you want to be notified whenever I post a new article you can just subscribe to this blog and have the articles sent to you as soon as I write them! (how awesome?!)

Anyway, thanks again for reading have a great day!

Why AlphaGo is a bigger game changer for Artificial Intelligence than many realize — October 9, 2017

Why AlphaGo is a bigger game changer for Artificial Intelligence than many realize

What’s all this fuss about the AI AlphaGo’s recent victory against the masters?

While it’s seemed like AI had hit a dead-end as much as a decade ago, if you’re like many of us sci-fi enthusiasts and have always wanted an AI best friend, the recent victory of AlphaGo has brought us much closer than you may have thought was possible.

AI is Finally Moving Forward

We’re not surprised if you haven’t been following the recent developments in AI all that closely because, for the most part, it’s seemed like nothing exciting has happened for quite a long time. Sci-fi dreams about computer powered best friends aside, AI for the general public has come to mean reasonably responsive and well-programmed computer assistance rather than independent thinking machines. Concepts like ‘smart’ chatbots somehow seem to pull us further from the Star Trek or Heinlinian dream of fully sentient and intuitive computers while many products and services that claim to integrate AI seem to be nothing more than a fast way to analyze large amounts of data. In fact, the last time most of us heard something hopeful about AI was when Deep Blue beat the world Chess champion, but what ever came of that AI? Surely it hasn’t used that incredible logical power to take over the world or begin making friends, so what do we even care?

Not All AIs are Equal

The answer lies in the fact that there many forms of Artificial Intelligence and most of them are limited by the tasks they were made to perform. That’s what makes AlphaGo so special, because while it was designed, named, and trained to play Go against the masters, its potential functionalities go well beyond the realm of board games unlike most of its AI contemporaries.

While practical applications for specifically built AI are growing, the tradition of training your AI programming skills on classic strategy games has existed since the 1950s when a computer was programmed to play and was able to win a game of tic-tac-toe. Since then a large variety of games and custom built AIs have been tested against each other to the great entertainment of experts in the field and curious nerds like us who care about that sort of thing. The real difference is not what they’re programmed for but how they are programmed to start with and, in fact, this is also what most profoundly distinguishes Alpha Go from its older-generation relative, the Chess champion Deep Blue.

Chess is a Closed Game

You may not know this, but there is a standard way to program an AI to play a board game known as the Search Tree in which the computer analyzes all the pieces and spaces in a game and determines which move during its turn is most likely to result in victory. However, for games with a limited total number of moves and responses, you don’t even have to spend too much time on programming good judgment, all you need is a complete understanding of the game. That said, consider how long people have been playing, analyzing, and writing down their analysis for chess.

Every possible arrangement of the limited and highly specialized pieces on the board has been replicated and studied in-depth. Do you know what they found? There is a finite number of possible piece arrangements on the board and each one of these finite arrangements has a finite number of moves that can be made and each of these moves can then be judged as a good or bad idea. In other words, you can contain every possible chess move and the best move for each board arrangement into a single database. That’s right, the quick and dirty way to make a chess “AI” doesn’t even require any thought, simply a database containing a complete knowledge of the game. Therefore AIs were always destined to master chess because it can simply store everything there is to know and reference it at will.

So how did Deep Blue win back in the 90s? You can breath easy knowing that the famous AI did not use the database method but instead relied on a parallel system designed to run a complex tree search. At each point in the game it would analyze the board and run an assessment on the possible moves it detected and which could move it closer toward a win. Defeating the world chess champion was a huge victory for Deep Blue by more than just capturing a king piece. It indicated that the AI’s board assessment program could be faster and smarter than a human strategy expert, but it was not what most of us sci-fi enthusiasts would think of as the beginning of independent computer thought. The only thing Deep Blue can do is play chess and because chess is a finite game, Deep Blue never needs to get smarter.

Go is Not a Closed Game

People have been trying to define Go for thousands of years. With computer analysis in hand, they have tried to discover if it is a finite game, like chess, and it simply cannot be done. With a near-infinite number of pieces available to each player and the complexities of the game itself, there are too many possibilities, board arrangements, and good or bad placement choices for any reasonable purpose-built program to handle. While you can make a program that plays go, until AlphaGo computer opponents only ever reached an intermediate level of capability and trying to fill a database with all the possible board arrangements and possible moves might well catch your servers on fire.

AlphaGo Learns to Play

It is for this reason that many people, Go masters included, were certain that a computer could never learn to beat the human champions of the game and for this reason that DeepMind decided to try. Why has AlphaGo succeeded where other AIs were judged to not even have a chance? The difference was that DeepMind decided to try something new in the world of games vs AIs: Machine learning and neural networks instead of custom built search trees. AlphaGo doesn’t just judge the board, it learns from its mistakes. Like a go expert who has been playing since their early childhood, they ran AlphaGo through thousands of games against itself and it learned from every one of those games how to be a better player, improve its strategy, and it never gets bored, frustrated, or tired during practice.

AlphaGo Teaches the Masters

Two years ago, DeepMind felt that AlphaGo was ready to start playing against expert human opponents and invited the European Go champion Mr. Fan Hui to a closed-door five-game test. To their surprise and delight, it won every single game and became the first computer program to defeat a professional go player. They then set it against the legendary winner of 18 world titles, Mr. Lee Sedol in Seoul in which it won 4–1 and earned a 9-dan professional ranking, the highest certification available. If this wasn’t awesome enough, during these games AlphaGo dazzled the audience and its opponent with creative winning moves, one of which effectively overturned hundreds of years of cumulative go wisdom.

DeepBlue Was Columbus Discovering America And AlphaGo Is The Moon Landing

Any computer scientist or programmer will admit that DeepBlue achieved something incredible when it beat Kasparov. But the amazing feat was in the computational power that DeepBlue had. It did not learn to play chess. It was programmed to search through thousands of chess games and evaluate the best move it had. Once DeepBlue had won the game and proven its strength, it was packed away and it has not been seen since. Everyone knew that its only purpose was to play chess and its programming could not be applied to much of anything else. AlphaGo, on the other hand, took the idea of computational power and added human reasoning or intuition — this combination makes it incredibly applicable to countless purposes.

Computer Scientists Versus Chess Masters

Another very unique aspect of how AlphaGo was created versus how DeepBlue was created is who the experts relied on. With DeepBlue, the computer scientists heavily relied on Chess experts, professionals, and masters to help the program have as many chess games programmed into it as possible. And the thing is, even after DeepBlue had strutted its stuff, it did not change much for the world of Chess. Chess players did not learn anything from it. With AlphaGo, however, the computer scientists simply used lots and lots of games from a myriad of players, who were all at different levels of Go knowledge and experience. And unlike when DeepBlue was unveiled, when AlphaGo was first shown to the world, Go players paid attention. They saw that AlphaGo was playing in innovative ways. It has taught them to think and play more creatively.

AlphaGo’s Intuitive Factor

It is easy to say that AlphaGo has intuition, which DeepBlue was missing. It is much more difficult to explain where that intuition comes from. To put it simply, it built on DeepBlue’s search and optimize idea. The DeepMind team programmed AlphaGo with 150,000 Go games that had been played by good players. It would then search through those games to base its next move on probability. To take AlphaGo to the next level, though, DeepMind used a neural network, or machine learning, so that through self-play and play against humans it could slowly make millions of tiny adjustments, allowing it to obtain something as close to intuition as possible.

And it is this intuition factor of recognizing good patterns and learning them that will have a much deeper impact on artificial intelligence. In the world of art, this type of artificial intelligence will expose a neural network to a specific artistic style, it will then show the network an image, and the network will replicate that image in the artistic style it was shown. In the world of language, the same neural networks are being used to recognize natural language. In the world of games, these networks are employed to improve video game experiences. And the list of future possibilities for expanding the impact of neural networks, machine learning, and artificial intelligence to provide the ability of intuition to computers is growing by the day — Think healthcare, smartphone assistants, and robotics. In fact, UK’s National Health Service has already signed a deal with DeepMind.

It Was Not Supposed To Be This Easy

Go is a game that has been around for 3000 years. It is widely accepted as the most challenging strategy game that exists. Individuals, especially in countries like South Korea and China, are sent to private school specifically to learn how to play the game at an expert level. It takes years of playing for several hours every day to master the game. In other words, even though it has simple rules, it is not a simple game to excel at. And due to its complexity, and how long it had taken computer scientists to create a machine that could win at Chess, experts estimated that a machine that could effectively play Go would only be created in about 10 years.

Surprise! Deep Mind managed to create a machine that could master the game, without being programmed with explicit rules and without being taught by a professional Go player. AlphaGo mainly played against itself and learned from this self-play. At its core, it learned like a human learns, by looking at the board, evaluating the options, making moves, and learning from mistakes — it just did it a lot faster than any human can.

This is extremely exciting because, at its core, what it means is that computer scientists have had all the tools they needed to do this for years. Neural networks have been known about and discussed since the middle of the last century. All it really took was simply getting creative with them, applying them in new ways. AlphaGo beating the world’s best Go player proves that AI has the potential to do anything. It can learn anything and understand anything, and from that learning and understanding it can accomplish what humans can accomplish in a much shorter period of time.

You’re probably wondering what this all means. The good news is that we’re much closer to the dream of an AI best friend than most of us would have dared to imagine a few years ago. Let it sink in for a moment: AlphaGo can learn the most complex, intuition and creativity based logic game known to man and it didn’t do so through a finite database or search trees alone. It learned from practice and experience, just like we do, and the ability to create amazing new solutions to ancient puzzles suggests a realm of digital creativity never before fathomed.

AlphaGo is not like other game playing AIs that have come before it. It is the future of intelligent and intuitive machines, one that we plan to turn toward more than just board games. From practical applications to that friend you’ve been hoping for, AlphaGo is sure to be the first of a new generation of self-learning intuitive AIs that go above and beyond the limited calculating capacities of its older siblings and contemporaries. If you love AI like I do, keep your eyes open for new practical applications for very real artificial intelligence popping up in places you may not have even imagined. The AI winter is over.

Scala vs Kotlin: Practical Considerations for the Pragmatic Programmer — September 14, 2017

Scala vs Kotlin: Practical Considerations for the Pragmatic Programmer

Java isn’t just a language; it’s an ecosystem. You can write code for the JVM without writing any Java. This gives you the option of using a more modern language. Some of the shortcomings of Java are obvious. It makes you write a lot of boilerplate code. It supports functional programming only as an afterthought; the lambda feature is a kludge. The NullPointerException is every Java programmer’s bane.

In 2004, a group led by Martin Odersky released an updated version of the language, called Scala (“scalable language”). It added features such as objects for everything, functions as assignable data, type inference, and pattern matching. It compiles to Java bytecodes and can be mixed with Java code.

Another language aimed at the same goals is Kotlin, released by JetBrains in 2012. It built on people’s experience with Scala. A common complaint with Scala is slow compilation time, and Kotlin offers compile speeds comparable to Java. It’s recently gotten a big boost from Google, which has declared it a first-class language for Android development.

If some features of Java constantly annoy you, you’ll find things to like in both languages. If you’re annoyed enough to make the jump, which way should you go? Should you choose the maturity of Scala or the freshness of Kotlin? There are benefits to each.

Solving problems different ways

Kotlin and Scala, like Java, are statically typed. Whatever type a variable starts out as, it will keep it for its whole life. But both of them save you some of the effort of declaring every variable. You can implicitly declare a type with an initializer. In either language you can write

var count = 1

That makes count an integer. Notice that no semicolon is required. The difference between the languages is that Scala goes much further in allowing implicit conversions. If you use x.transmogrify(), and x belongs to a class which doesn’t have a transmogrify function, that isn’t necessarily an error. You can create an implicit class which has a transmogrify method, and the compiler will figure out, without making you do any casting, whether it can step in to do the job.

Kotlin’s creators found this a little too free-wheeling. It lets you define extension methods on a class, adding custom functionality. You can do this even on standard data types. (Remember, everything is an object, so every data type is a class. Boxing of simple data types is no longer needed.)

Null values are a huge headache in Java. Scala helps to relieve this in a couple of ways. First, variables must be initialized. You can initialize them to the default value (var a:Int = _), which is often null, but at least it makes you aware you’re doing it. Second, the Option class helps in guaranteeing null-safety in parameters and returned values. It’s one of the more complicated features of the language to understand, but it gives you a lot of control.

Kotlin gets right to the point. By default, it doesn’t allow variables to have the value null. You can declare a variable to be nullable if you really need to, by putting a question mark after the type. If you’ve worked with Swift, this approach will sound familiar. If you use nullable values, the compiler does extensive checking to make sure you aren’t putting them at risk of a NullPointerException and will give you a compile-time error if you are.

Java makes you use a regular class or an enum if you just want to package some data together in an object. Scala and Kotlin offer some better options. Scala gives you the case class, which a specialized class for data objects. It automatically defines accessor functions (why doesn’t Java just do that?). Instances are compared by structure rather than reference. A copy function is automatically provided to do a shallow copy.

Kotlin’s data class does pretty much the same thing. The main difference is that Scala has a powerful pattern matching feature which Kotlin didn’t pick up. A match statement is like a bionically enhanced case statement. Patterns can check not only literal values but types, lists, and ranges. Scala can do matching on all kinds of objects, but the feature is especially powerful with case classes.

Scala provides strong support for XML. You can put XML directly into Scala code and assign it to an XML object. This creates complications, since a <operator that isn’t followed by a space may be read as the start of an XML expression. Kotlin uses the more traditional approach of classes to handle XML objects.

Type classes are a feature of Scala that doesn’t have an equivalent in Kotlin. A type class defines a set of operations which member classes must support. This isn’t like subclassing in Java; a type class can be added to types that already exist. It lets the developer create new kinds of polymorphism with existing types. Extension functions in Kotlin aren’t the same thing, but they let you add common ground to different types, so they address some of the same needs.

The feel of the language

OK, there are differences between the languages, but they aim at more or less the same thing. You can learn either one. Are there bigger, more philosophical reasons for choosing one or the other?

To some people, the difference is that Scala is more aimed at exploring new ideas, and Kotlin is more focused on getting results. Kotlin’s emphasis on fast compilation and its removal of some of Scala’s more esoteric features reflect this. Scala just lets you do lots of things. The operator name ?:+ appears to be legal, and maybe there’s a reason you’d want to use it. Kotlin is more restrictive. Some would say saner.

If you love functional programming, Scala has more of its features than Kotlin. Type classes are a functional programming feature. As another example, Scala supports currying and partial application, which are ways to break down functions that take multiple arguments. This provides additional flexibility in using argument lists. Kotlin provides ways to do the same things, but they might not be as mathematically elegant.

People who have learned Scala thoroughly love it. It takes more effort, but it lets developers do things they can’t do in Kotlin. Kotlin adherents often find that much flexibility more confusing than useful.

Practical considerations

Sometimes the realities of what you’re trying to do are the main factor. You need to pick the language that will let you do the job, even if you don’t like it quite as much. If you ‘re going to do Android development, Kotlin is the only choice. Android doesn’t use Oracle’s JVM, so you can’t use any old JVM-compatible language. Kotlin has the tools for compiling, debugging, and running software on Android. It’s built into Android Studio, starting with version 3.0.

Outside Android, Kotlin’s options are more limited. Are you committed to Eclipse for your IDE? You can use it to work with both languages, up to a point. The Scala IDE for Eclipse is more mature than the Kotlin plugin, which is a bit painful to set up. Some users have reported trouble getting the latter to work. The situation for NetBeans is similar. With Kotlin’s growing popularity, the situation may be more equal in a year or two. If you like working from the command line, the IDE situation isn’t an issue, and Kotlin has all the necessary tools.

Kotlin is still maturing, but many Java people find adopting it is an easier transition than Scala is. The one that works best for your needs will depend on your personal style and your practical aims. Look at both carefully before making a decision.

If you enjoyed this article please share it! This is the biggest compliment you can give a writer! Thank you!

The Simply Deep, Yet Convoluted World of Supervised vs Unsupervised Learning — September 6, 2017

The Simply Deep, Yet Convoluted World of Supervised vs Unsupervised Learning

Artificial intelligence (AI) is a lot like life’s relationships. Sometimes what you put into it is pretty straightforward, leading to the output or outcome that you wanted. Other times, let’s just say, the process gets a bit more convoluted and sometimes the outcome isn’t exactly what you envisioned. In other words, you may input the same into both relationships, but different paths lead you to different results. Nevertheless, both are learning processes. In the AI world, this is called supervised and unsupervised deep learning–and like most relationships, the shortest distance between what you input to what you get as output isn’t always the proverbial straight line.

What is Deep Learning?

Before we delve into what supervised and unsupervised deep learning is, you should know that deep learning evolved from a process called machine learning. Machine learning employs an algorithm, or set of rules, that creates output without specific programming. Think about how social networking mines data from your posts. For instance, you go out to eat with your friends at your favorite sushi place and share facts online about your experience–what you loved, found distasteful, photos, would you return–once you input these into your social network, an algorithm picks up tidbits about your input to extract patterns about what you like, don’t like, even what you look like based upon your pictures. The algorithm may discover that you are around 23 years old, eat out at this particular type of restaurant twice a month with your friends and like California rolls over eel sushi. It then sends you ads based upon that data. Machine learning iteratively gleans information about input despite not being told how to do so or where to look for that information.

Deep learning kicks it up a notch. It takes your input, finds that it can either categorize it without issue (supervised) or clusters unlabeled information, attempting to categorize it so that it makes sense (unsupervised), before taking that input and creating some sort of viable output. It’s a layered architecture making sense of data that can be quite abstract from one layer to another. That’s how deep learning emulates the multi-faceted complexity of the human brain–its neural pathways processing copious amounts of information that doesn’t make sense until it does (or not).

Supervised Deep Learning: The KISS Pathway That Leads To The Expected

What happens when your supervisor’s hanging over your shoulder at work? Like most, it drives you batty, so you tend to take the path of least resistance to find the most non-challenging way to get your job done quickly and still meet the expectations of your supervisor, right? Let’s say that particular supervisor trained you to process credit applications. Said supervisor knows what’s in those applications and that the expected outcome of any application is approval or not approval. You learned from your training set how to function in the best way to get to the desired outcome, i.e. the results that your supervisor needs. Supervised deep learning is like that. We humans tend to process in a specific hierarchy: we take in life’s input and based on our experiences (training), we organize that input so that our prior knowledge can make sense of it, process it to some expected outcome. Supervised deep learning belongs to that Keep-It-Simple-Stupid (KISS) pathway, that literal path of least resistance leading to some fulfilled expectation.

Supervised deep learning is well suited for decision-making: take our credit card example for instance. The bank takes your application and runs it against its categories of risk before taking action for or against approving you. Here’s the procedural gist:

Application is input from customer

The bank inputs data from application into the algorithm

The algorithm notices from past applications that data follows certain pathways (modeling)

For example: marital status–single, married, divorced, widowed all have a yes or no answer

The algorithm takes that application data, the yes or no answers, as determined by the bank and follows its flowchart (pathway rules)

Data flows through that pathway as the algorithm decides which of the primary categories of approved and not approved the data belongs in

The expected decision of approved or not approved is rendered

The customer is approved and is a happy camper or is not approved and wonders how to fix his credit score (had to throw that in).

Supervised deep learning is more than your typical lights on, lights off binary function. The algorithm classifies criteria into the bank’s expectation of risk, processing that risk into one of two decisions. This method of classification is known as binomial classification (two choices) or multi-class (more than two choices).

Unsupervised Deep Learning: An Exploratory Journey To Figuring Out the Unknown

If supervised deep learning is a path to expected output, unsupervised deep learning takes that same input and attempts to make sense of it before eschewing some output. Let’s take a trip to the art museum with your best friend as an example. You both become captivated with a painting of a rose. One of you sees it rather literally, the other sees it figuratively. To you, a rose is a just a rose and you want to move on to the Van Gogh exhibit. To your friend that rose is yellow when it should be red and your friend cannot figure out why the painting denotes friendship and not love. There’s no Van Gogh until there’s ready to go–and that’s not happening until your friend muses about that rose and why her current relationship is hanging on that museum wall.

Unsupervised deep learning has no target, no expectation from the input. It relies on exploring layers of possibilities to get to some conclusion. While you can move on to the Van Gogh exhibit, your friend struggles to figure out how to classify all the many pathways friendship and love can take someone from convolution to happy life and how one can learn from their mistakes.

Decision Time: If You Knew Then What You Know Now

Humans are subjectively sentient creatures with decision-making processes that cater more to the unexpected (unsupervised deep learning) than to the expected (supervised deep learning). Computers don’t have the human factor. They don’t have experiences. They just have data sets, functions, and “thinking” based on layers of pooling information together in either ordered or non-ordered ways.

As neural nets and AI become more complex, so do the deep learning algorithms. You can choose among supervised, unsupervised or a combo-pack of deep learning to tackle anything from credit approvals to the complexities of mind-boggling, robotic data sets. Remember the social networking example? When you uploaded images, something called Convolutional Neural Networks (CNN) picked out traits before it came to the conclusion that you around 23, pooling together relevant data: restaurant, friends laughing, friends frowning, facial recognition, background recognition. Combine and categorize those subgroups and your image spoke volumes about who you are and how you live. Imagine what they’d unpack from what you say on an uploaded video (Recurrent NN)? Yet, sometimes life has to unfold unsupervised by knowns, reconstructing (autoencoding) the data-driven universe while self-organizing maps translate often nebulous data patterns into two-dimensions (think topographic maps) that allow you further muse as to why that rose by any other name is just backpropagation (wink).

Understanding Recurrent Neural Networks: The Preferred Neural Network for Time-Series Data — June 26, 2017

Understanding Recurrent Neural Networks: The Preferred Neural Network for Time-Series Data

Artificial intelligence has been in the background for decades, kicking up dust in the distance, but never quite arriving. Well that era is over. In 2017, AI has broken through the dust cloud and arrived in a big way. But why? What’s the big deal all of a sudden? And what do recurrent neural networks have to do with it? Well, a lot, actually. Thanks to an ingenious form of short-term memory that is unheard of in conventional neural networks, today’s recurrent neural networks (RNNs) have been proving themselves as powerful predictive engines. When it comes to certain sequential machine learning tasks, such as speech recognition, RNNs are reaching levels of predictive accuracy, time and time again, that no other algorithm can match. However, the first generation of RNNs, back in the day, were not so hot. They suffered from a serious setback in their error-tweaking process that held up their progress for decades. Finally, a major breakthrough came in the late 90s that led to a new generation of far more accurate RNNs. Building on that breakthrough for nearly twenty years, developers refined and perfected their new RNNs until all-star apps such as Google Voice Search and Apple’s Siri started snatching them up to power key processes. Now recurrent networks are showing up everywhere, and are helping to ignite the AI renaissance that’s unfolding right now.

Neural Networks That Cling to the Past

Most artificial neural networks, such as feedforward neural networks, have no memory of the input they received just one moment ago. For example, if you provide a feedforward neural network with the sequence of letters “WISDOM,” when it gets to “D,” it has already forgotten that it just read “S.” That’s a big problem. No matter how hard you train it, it will always struggle to guess the most likely next character: “O.” This makes it a rather crappy candidate for certain tasks, such as speech recognition, that greatly benefit from the capacity to predict what’s coming next. Recurrent networks, on the other hand, do remember what they’ve just encountered, and at a remarkably sophisticated level.

Let’s take the example of the input “WISDOM” again and apply it to a recurrent network. The unit, or artificial neuron, of the RNN, upon receiving the “D” also takes as its input the character it received one moment ago, the “S.” In other words, it adds the immediate past to the present. This gives it the advantage of a limited short-term memory that, along with its training, provides enough context for guessing what the next character is most likely to be: “O.”

Tweaking and Re-tweaking

If you like to get into the weeds, this is where you get excited. Otherwise, get ready for a rough patch. But hang in there, it’s worth it. Like all artificial neural networks, the units of an RNN assign a matrix of weights to their multiple inputs, then apply a function to those weights to determine a single output. However, recurrent networks apply weights not only to their present inputs, but also to their inputs from a moment ago. Then they adjust the weights assigned to their present and past inputs through a process that involves two key concepts that you’ll definitely want to know if you really want to get into AI: gradient descent and backpropogation through time (BPTT).

Gradient Descent

One of the most famous algorithms in machine learning is known as gradient descent. Its primary virtue is its remarkable capacity to sidestep the dreaded “curse of dimensionality.” This issue plagues systems, such as neural networks, with far too many variables to make a brute-force calculation of their optimal values possible. Gradient descent, however, breaks the curse of dimensionality by zooming in on the local low-point, or local minimum, of the multi-dimensional error or cost function. This helps the system determine the tweaked value, or weight, to assign to each of the units in the network, bringing accuracy back in line.

Backpropogation Through Time

The RNN trains its units by adjusting their weights following a slight modification of a feedback process known as backpropogation. Okay, this is a weird concept. But if you’re into AI, you’ll learn to love it. The process of backpropogation works its way back, layer by layer, from the network’s final output, tweaking the weights of each unit, or artificial neuron, according to the unit’s calculated portion of the total output error. Got it? If so, get ready for one more layer of complexity. Recurrent neural networks use a heavier version of this process known as backpropogation through time (BPTT). This version extends the tweaking process to include the weight of the T-1 input values responsible for each unit’s memory of the prior moment.

Yikes: The Vanishing Gradient Problem

Despite enjoying some initial success with the help of gradient descent and BPTT, many artificial neural networks, including the first generation of RNNs, eventually ran out gas. Technically, they suffered a serious setback known as the vanishing gradient problem. Although the details fall way outside the scope of this sweeping overview, the basic idea is pretty straightforward. First, let’s look at the notion of a gradient. Like its simpler relative, the derivative, you can think of a gradient as a slope. In the context of training a deep neural network, the larger the gradient, the steeper the slope, the more quickly the system can roll downhill to the finish line and complete its training. But this is where developers ran into trouble — their slopes were too flat for fast training. This was particularly problematic in the first layers of their deep networks, which are the most critical when it comes to proper tweaking of memory units. Here the gradient values got so small, and their corresponding slopes so flat, that one could describe them as “vanishing,” thus the vanishing gradient problem. As the gradients got smaller and smaller, and thus flatter and flatter, the training times grew unbearably long. It was an error-correction nightmare without end.

The Big Breakthrough: Long Short-Term Memory

Finally, in the late 90s, a major breakthrough solved the vanishing descent problem and gave a second wind to recurrent network development. At the center of this new approach were units of long short-term memory (LSTM).

As weird as that sounds, the long and short of it is that LSTM made a world of difference in the field AI. These new units, or artificial neurons, like the standard short-term memory units of RNNs, remember their inputs from a moment ago. However, unlike standard RNN units, LSTMs can hang on to their memories, which have read/write properties akin to memory registers in a conventional computer. Yet LSTMs have analog, rather than digital, memory, making their functions differentiable. In other words, their curves are continuous and you can find the steepness of their slopes. So they are a good fit for the partial differential calculus involved in backpropogation and gradient descent.

Altogether, LSTMs can not only tweak their weights, but retain, delete, transform and otherwise control the inflow and outflow of their stored data according to the quirks of their training. Most importantly, LSTMs can cling to important error information for long enough to keep gradients relatively steep and thus training periods relatively short. This wipes out the vanishing gradient problem and greatly improves the accuracy of today’s LSTM-based recurrent networks. Thanks to this remarkable improvement in the RNN architecture, Google, Apple and many other leading companies, not to mention startups, are now using RNNs to power applications at the center of their businesses. In short, RNNs are suddenly a big deal.

What to Remember about RNNs

Let’s recap the highlights of these amazing memory machines. Recurrent neural networks, or RNNs, can remember their former inputs, which gives them a big edge over other artificial neural networks when it comes to sequential, context-sensitive tasks such as speech recognition. However, the first generation of RNNs hit the wall when it came to their capacity to correct for errors through the all-important twin processes of backpropogation and gradient descent. Known as the dreaded vanishing gradient problem, this stumbling block virtually halted progress in the field until 1997, when a major breakthrough introduced a vastly improved LSTM-based architecture to the field. The new approach, which effectively turned each unit in a recurrent network into an analogue computer, greatly increased accuracy and helped lead to the renaissance in AI we’re seeing all around us today.

If you have enjoyed this post, the biggest compliment you could give would be to share this with someone that you think would enjoy it!

If you would like to see more articles like this, click the subscribe button and never miss a post. Have a great day and never stop learning!

12 Most Influential Books Every Software Engineer Needs to Read — June 21, 2017