# How can I teach my machine to learn?

In this latest column for our ongoing series on Deep Learning, we will consider the question, “How can I teach my machine to learn?” Like humans, machines learn from experience. They make observations from inputs of images, text, or other data, and then look for patterns. After the machine runs through the mathematical layers, it learns to make better decisions based on the examples it was given. The decision outputs can be continuous (for example, fluctuating prices), binary (“yes” or “no”), or categorical (such as recognizing an aircraft, tank, helicopter, or submarine, etc.). In the case of a categorical output, the resulting answer will be several variables (for example, attributes that describe the aircraft) instead of a single variable. The training environment for machine learning applications will roughly fit into one of three major categories: supervised learning, unsupervised learning, or reinforced learning. In this column, we will look at both supervised and unsupervised learning, plus the hybrid approach of semisupervised learning. In our next column for this series, we will consider reinforced, machine-learning environments.

**Supervised learning: teach me**

Because it is both the easiest to understand and the simplest to implement, supervised learning has emerged as the most popular method for machine learning. Supervised learning uses labeled data on both the input and output, which enables the machine to learn the relationship between the inputs and the outputs. A label is a tag (or description) given to the objects in the data. One can think of the machine-learning algorithm as a child and the labeled data as flash cards the child uses to learn. Given a specific labeled input, the algorithm is able to predict the answer. The output label provides the algorithm with feedback that lets it know if its answer is correct. Over time, the algorithm is able to learn through observation and discover patterns in the data. After discovering the patterns, the algorithm is ready to tackle a new, previously unknown input, and based on that input, correctly provide the output. This approach to machine learning can be described as the machine version of “concept learning,” as defined by psychology.

There are two main types of supervised learning problems. The first problem type uses classification algorithms to predict a categorical result. The second type uses regression algorithms to predict a numerical label based on the strength of correlation between attributes. Both object recognition and gesture interpretation use classification algorithms.

Assume the task is to classify (or identify) a type of aircraft based on a photographic image. The first step is to consider the various attributes needed to identify an aircraft. Since the number of possible attributes is rather large, we’ll just focus on wings. Wing position would be one attribute, with options for top-mounted, mid-mounted, or bottom-mounted. Wing shapes could be straight, swept-back, delta, or semi-delta. Other wing attributes would include slant, taper, and wing-tip shape. These are of course only some of the wing attributes. Now imagine identifying additional attributes to describe the engines, fuselage, and tail, etc. To complicate matters, most combat aircraft can also carry different ordnance, fuel tanks, or pods, depending on their mission. It gets even more complex; keep in mind that the aircraft classification has to work effectively for all possible viewing aspects, including front, side, back, different rotation angles, and in all types of weather and backgrounds.

After the aircraft classification algorithm is designed, the model trained, and the desired accuracy verified, the classifier is now ready to use. The accuracy of the algorithm is measured by the percentage of correctly classified items out of all predictions made. When fed an image, the algorithm will process that image by classifying each attribute and return the aircraft type that most closely matches the image’s attributes. For example, the model could have a return that states with 85% certainty that the image is a fighter, while the other 15% matches a skywriting airplane. Common classification algorithms used for object identification include Naïve Bayes, logistic regression, Support Vector Machines (SVMs), and k-Nearest Neighbors (k-NN).

Another subfield of supervised machine learning is regression analysis, which aims to model the relationship between selected input features to a continuous output variable. If a variable is assignable to any value between a minimum and maximum value, it is a continuous variable. A discrete variable has a countable number of possible values. For example, a person’s exact age is a continuous variable, while their age measured by number of years is a discrete variable. The output of regression prediction problems normally results in quantity or size data, while the input can be any mixture of continuous and discrete variables. A classic example of how supervised regression is used is the predicting of used-car prices based on a set of attributes, such as mileage, age, brand, and location. A timer-series forecasting problem is another regression problem, with input variables ordered by time.

A common performance measure for regression problems is Root Mean Square Error (RMSE), which measures the typical error made by the system in its predictions. Some algorithms, such as decision trees/random forest, can be deployed for both classification and regression with minor modifications. Sometimes, the user can choose whether to use classification or regression by manipulating the data. One method of data manipulation, known as discretization, divides the quantity into multiple discrete bins that have an ordered relationship. For example, used car prices could be divided into three classes: less than $10,000, $10,000-$20,000, and over$20,000.

Supervised machine learning tightly controls the inputs, outputs, and training, which makes it easy to define the goal of the training and to measure the accuracy of the error by which success is measured. For example, the goal might be to determine if a ground-moving target is a tank, an Armored Personnel Carrier (APC), or a mobile missile launcher, with an accuracy of 97%.

Supervised machine learning also involves certain tradeoffs: One such tradeoff is that constructing a model with good accuracy percentage requires a large amount of data. This effort can become labor-intensive, as a human must attach labels to the data and correct or remove any bad data in the process. Also, with supervised machine learning, the possible outputs are predefined, which means the machine’s ability to explore other possibilities and gain new insights into connections in the data is limited or eliminated completely.

**Unsupervised learning: I am a self-learner**

Unsupervised learning, also known as a type of Hebbian learning, is essentially the opposite of supervised learning. Without labels and with no predefined outcomes, a machine doing unsupervised learning will search for underlying relationships in the data, such as hidden patterns and structures. In some cases, discovering the hidden patterns will itself be the end goal. Alternately, the user might select algorithms to compress data, organize the data by similarity, or detect abnormalities. Since the vast majority of data in the world is unlabeled, algorithms that can process terabytes of data into something useable are a major productivity enhancement. Based on the data and its properties, unsupervised learning is considered data-driven (as opposed to supervised learning, which is task-driven).

Unsupervised learning can be broken down into three categories of algorithms: clustering, dimensionality reduction, and associated rules analysis. Clustering, along with density estimation — which determines the distribution of data in the sample set — is the most common class of algorithm in unsupervised models. It’s used, for example, in customer segmentation, fantasy-league statistical analysis, and the cyber profiling of criminals. In unsupervised clustering, the objects are sorted into groups whose members share a particular similarity. Unlike in supervised algorithms, groups are unlabeled, so the user must determine what the clusters represent. For example, the algorithm might sort military ground vehicles into groups based on whether the vehicle had tracks or wheels. The importance of clustering for unsupervised machine learning has resulted in a plethora of clustering algorithms for users to choose from.

These algorithms include:

- K-means: Data points are clustered into K-exclusive groups
- Hierarchical Cluster Analysis (HCA): Data points are separated into parent and child clusters
- Expectation Maximization (EM): An iterative approach that cycles between two modes, E-Step and M-Step

- E-Step: Estimates the expected value for each unobserved variable

- M-Step: Optimizes the parameters of the distribution using maximum likelihood

A task related to clustering is dimensionality reduction, which is the process of distilling relevant information from the chaos of the unlabeled data set, and then removing any unnecessary information. As data sets grow larger, both in the number of samples and variables, it becomes harder to guide the model in the right direction. Dimensionality reduction decreases the complexity of the data, enabling the model to run faster, take up less memory space, and in some cases, exhibit better performance.

The problem is how to reduce the dimensions of the data without losing important information in the data set. One of the more common dimensionality reduction algorithms, Principal Component Analysis (PCA), finds underlying variables that best differentiate the data points. Principal Components (PC) are the dimensions along which the data points have the greatest spread on a graph. A PC can consist of one or more variables. If a feature is the same across a majority of the objects, it is not useful in characterizing the data. For example, say you want to compare single-malt Scotch whiskys. The characteristics could include region, cost, color, body, level of peat, fruit, brine, etc. Looking at single-malt Scotch whiskys in the PCA space, the whisky will naturally cluster by six geographical regions as represented on the x-axis. For comparison, Scotch whisky from the islands of Campbeltown and Islay will be salty and smoky with heavy peat, while Speyside whisky will have a lighter, smoky taste with a mixture of peat levels. Lowlands and Highland whisky will be smoother with more fruit flavors. The salt and smoky attributes relate to location and are therefore redundant. Combining the peat level with the fruitiness for the second PC will spread the whisky types within each geographic location grouping. PCA finds the best possible characteristics in the data, including combinations that best summarize and allow it to predict or “reconstruct” the objects.

Another cornerstone of unsupervised machine learning is the use of associated rules learning to enable predictive analytics. Associated rules learning involves a series of techniques applied to discover interesting relations between attributes. These relations can provide a basis for making predictions and calculating the probabilities of certain events based on the occurrence of other events. For example, an associated rules algorithm run by a grocery store on previous sales might reveal that people who buy potato chips and barbecue sauce also tend to buy beer. Given this information, the store manager could design an endcap promotion featuring all three of these products. Two of the most popular associated rule-learning algorithms are the Apriori algorithm, an iterative approach where previous item sets are used to find the next group of item sets, and the Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT) algorithm. While the Apriori algorithm works like a breadth-first search, ECLAT emulates a depth-first search, which makes it more efficient.

Compared to supervised learning, a machine using unsupervised learning faces more challenges, since it doesn’t have predefined outcomes and feedback for correct answers. Determining the success of an unsupervised algorithm is also more difficult without “ground truth.” Even with these limitations, unsupervised algorithms can produce high-quality results when given free rein to discover the patterns by themselves. Also, unsupervised learning works on unlabeled datasets, which is another plus given that most of the data in the world is unlabeled, and the process of annotating large amounts of data can be very time consuming and prohibitively expensive.

**Semi-supervised learning: I’ll get by with a little help from my friends**

Semi-supervised learning, which straddles the line between supervised and unsupervised learning, takes advantage of both labeled and unlabeled data for training. In the world there often exists a tiny bit of labeled data surrounded by a vast majority of unlabeled data. A young child that learns from parents and teachers while also discovering knowledge on their own provides a good example of semi-supervised learning. Assume that child has a Labrador dog as a pet, and the parents teach the child that the animal is a dog (the label). Later, when visiting a relative who has a Pug for a pet, the child might successfully identify the very different looking animal as a dog based on observing the existing similarities (the unlabeled data).

Some practical applications for semi-supervised learning are speech analysis and Internet content classification, as well as DNA and RNA sequence classification. The labeled data helps identify that there are specific groups and provides hints of what they might be. Not surprisingly, most semi-supervised algorithms combine both supervised and unsupervised algorithms. For example, unsupervised Restricted Boltzmann Machines can be stacked to form unsupervised Deep Belief Networks (DBN), with supervised learning networks then used to tune the final system.

This column aimed to provide a better understanding of supervised and unsupervised machine learning and the hybrid approach of semi-supervised machine learning. In addition to providing examples, the most common algorithms used in each category were referenced, as well as the pros and cons of each type of machine learning. In the next column, we will discuss reinforced learning. In the meantime, you might consider buying some potato chips, some nice hoppy beer, and barbecue sauce for your weekend grilling, or alternately, consider performing some unsupervised tasting of single-malt Scotch whiskys.