Unsupervised Modeling (1/2)
In this lesson, you’re expected to:
– learn about unsupervised machine learning
– understand the basics of clustering
Machine Learning: Beyond Supervised Learning
In the previous lessons, we learned about supervised learning. As we mentioned in supervised learning, we have historical data where each observation is labeled. Thus, the model learns from that historical data, and is able to estimate the label of unseen or future observations.
Examples of this kind of problem are churn detection or accident rate prediction.
There are other machine learning problems. In fact, we have three major machine learning classes:
1) Supervised learning
Algorithms which learn from a training set of labeled examples and are able to generalize the predictions for unseen data. Examples of techniques in supervised learning include regression and classification problems.
2) Unsupervised Learning
Algorithms which learn from a dataset of unlabeled observations. These algorithms use the features of the inputs to group observations together according to some statistical criteria. They group observations that are similar.
Examples of unsupervised learning include k-means clustering and hierarchical clustering. A typical unsupervised problem is customer segmentation. Companies across several sectors are interested in grouping similar customers together.
3) Reinforcement Learning
This kind of learning is inspired by behavioral psychology. Algorithms that learn via reinforcement from a critic that provides information on the quality of a solution, but does not include information on how to improve it.
Unsupervised learning is a family of algorithms which learn from unlabeled data, using the features of each observation to group them together according to some statistical/geometric criteria.
The main kinds of unsupervised learning problems are:
– Clustering: partition examples into groups when no pre-defined categories/classes are available.
– Novelty detection: find changes in data.
– Outlier detection: find unusual events (e.g. malfunction).
The most widely used unsupervised business task is clustering.
While in supervised learning we had predefined classes, in clustering problems we don’t have predefined classes and we will just limit to group similar observations together into classes.
Hence, an important difference is that in unsupervised learning, after grouping the observations into classes, we need to make sense of these newly obtained groups, as we have no previous definition of them. Thus, we need to understand and define the resulting clusters, and make sure these clusters make business sense.
In a clustering problem, we group unlabeled observations into disjoint subsets of clusters, such that:
– Examples within a cluster are similar within each other (high intra-class similarity).
– Examples in different clusters are different (high inter-class difference).
Applications of Clustering
Clustering helps to discover new categories in an unsupervised manner. It has many business and research applications. Some of them are the following:
– Group individuals with the same political view
– Categorize documents of similar topics
– Group genes that perform the similar functions
– Customer segmentation, grouping similar customers together
– Social network analysis. For example, grouping similar tweets together or detecting groups of people with the same interests.
Most clustering algorithms rely on distance. Hence, we need to define a distance between observations and the algorithm will cluster observations which are close together.
In the next lesson, we will review two widely used clustering techniques:
– k-means clustering
– hierarchical clustering
To understand these techniques, we will use a classic business case: customer segmentation.
In customer segmentation, we aim to group similar customers together. We can thus obtain groups of users that are easier to characterize and understand than individual customers.
For example, by grouping customers based on their credit card activity, we will find groups of customers that buy similar products and therefore have similar lifestyles.
Some clusters we may expect to find could be:
– Runners / Gym addicts: People who buy lots of sports products.
– Football lovers: They go to every match of their preferred team.
– Beer lovers: They go a lot to bars.
– Fashion addicts: They spend lots of money on clothes and expensive brands.