Unsupervised Modeling (2/2)

Unsupervised Modeling (2/2)

In this lesson, you’re expected to:
– understand K-Means Clustering
– learn about Hierarchical Clustering

K-Means is a hard partition algorithm, as it assigns a unique cluster value to each element in the feature space. However, there are other clustering algorithms where an observation can belong to multiple clusters. In K-means, the number of clusters is an input parameter. Thus, we need to know how many groups we expect to find in our data.

The algorithm will not detect the optimum number by itself. We may choose these parameters based on our business experience/knowledge and/or some statistical methods that we will discuss later.

Process of K-Means Clustering

The K-Means algorithm works as follows:

Step 1: Select the number of clusters we want to divide our data into. This means we need to determine K.

Step 2: Initialize the K cluster centers. A simple strategy could be to select them randomly.

Step 3: Assign a class to the N data samples. We do so by assigning them to the nearest cluster centroid.

Step 4: Re-estimate the K cluster centers. We compute the new cluster center as the center of all the observations assigned to that cluster.

Step 5: If none of the N observations changed membership in the last iteration, the algorithm ends. Otherwise, you need to go back to Step 3.

The following figure represents the final output of a k-means clustering, where the observations have been grouped into three clusters. The defined 2-D areas represent the frontiers between clusters.

Advantages of K-Means Clustering

One of the advantages of this method is that we can not only classify each customer according to their segment, but also compute the distance from each customer to other segments.

For example, we may classify a bank customer as a fashion addict. But is this fashion addict similar or very different from those liking football? If we compute the distance from this individual to the centroid of the football cluster, we can find the answer. We could find out that Customer A is classified as a fashion addict and probably does not like football as his distance from that centroid is 100. However, Customer B, who is also classified as a fashion addict, may like football as his distance to that cluster is only 10.
Another advantage of clustering customers is that we can summarize the information of each cluster and use it to extract patterns, and characterize individuals by their group behavior.
An example is shown in the following figure, where we show the differences in purchase behavior across 4 detected clusters. If we were to represent that figure/table for all the customers of the bank it would be so big that the human brain would not be able to extract any relevant information from it.
Image Source: optimove.com
Enlarged version: http://bit.ly/2oGv7l6

However, by performing a K-Means clustering, we are able to detect four groups and by extracting the patterns of each cluster, we can understand their behavior and make decisions.

We have reduced our problems and summarized the purchase behavior n into 4 differentiated patterns and assigned one of these patterns to each customer.

[Optional] Customer Segmentation via Cluster Analysis

How do I select the number of clusters (K)?

To answer this question, our previous business experience and our domain knowledge will be vital.

Why? An important reason is that after clustering the observations, we will have to make a business interpretation of them, and give them some meaning.

However, there are also techniques that help us find the statistically optimum value of K. The elbow technique is one of them.

What is The Elbow Technique?

By the definition of clustering, we want clusters to be as compact as possible. The notion of compactness can be measured by measuring the distance of the members of the cluster to its centroid. Obviously, with more clusters, the total distance of the observations to their centroid will always decrease, tending to 0 when k tends to N.

Why? If we have N clusters, we have the same number of clusters as observations. Hence, each cluster will contain a single observation and the centroids will be positioned on top of the observations. The elbow technique can help us to find the optimum value of K.

In this technique, we distinguish between two phases in the process of checking the compactness against the number of clusters.

Initially, the average distance of the observations to their centers will decrease dramatically. At some point, there will be a threshold from which it will slowly stabilize. The elbow technique consists of selecting the value where this transition occurs. This cannot always be easily found though.

This method seeks to build a hierarchy of clusters. It is based on a very simple principle: if we can measure the distance among observations, then by grouping the closer ones, we can obtain the clusters.

The idea is that we repeatedly find the two most similar clusters and merge them together. Thus, we not only end up with clusters but also with a hierarchy of clusters.\

To understand this, let’s look at an example. 

Suppose we have 6 observations represented in a 2-D space. Let’s group them using hierarchical clustering. You can see the distances between them in the figure below.

The first step is to determine which elements to merge into a cluster. Usually, we want to take the two closest elements. In this case, we will merge the following elements:
– pair node0-node1
– pair node2-node3

So now we have the cluster [node0-node1] which we will label c1, cluster [node2-node3] which we will label c2, node4, and node5.

Now, we will merge the next closer elements. In this case, we will merge clusters c1 and c2, and label it as c12.

Then we will merge c12 and node4 and label it c12-4. Finally, we will merge node 5 and cluster c12-4, which we will call cT.

We can visualize the resulting hierarchical clustering in the form of a dendrogram (a tree diagram), as depicted below.

This enables us to cut the dendrogram at different levels and view the cluster structure of our data with different resolution.
[Optional] Hierarchical Clustering: A Simple Explanation 
Jim Rohn Sứ mệnh khởi nghiệp