Evaluating a Business Predictive Model
In this lesson, you’re expected to:
– learn how to measure performance using predictive models
– understand the problems of unbalanced datasets and overfit
– discover how to evaluate problems using a confusion matrix
The perfect model does not exist – every model has an error component. Thus, if we have a predictive model in our business that we want to run in production to aid decision making, we need to assess its performance. Before we can use its results to make business decisions or to select the customers who we will target a campaign with, we have to evaluate its uncertainty and limitations.
In summary, we need to know how good the model is. There are different criteria for measuring performance of a supervised classifier and the most adequate metric depends on the problem we are trying to solve. There is no direct recipe.
In the case of the churn problem, the number of correct predictions would be the sum of the customers who left the company (and the model predicted they would leave the company), plus those that stayed (and the model predicted that they would not leave).
N would correspond to the total number of phones in the dataset whose probability of churn you’re trying to predict. In our example, the target is a binary variables, as our questions were answered with yes/no.
For example, if we have an accuracy of 95%, the 5% error could be evenly distributed between all the classes or just emerging because the model is confusing two similar classes.
In both cases, the accuracy is the same but the scenario is completely different. Suppose we have a model that classifies news according to topic: sports, politics, and health. When the model fails its prediction, what is the reason? Is it equally probable to misclassify all of the classes? Or do the health and sports category resemble each other and the model tends to confuse them but always correctly classifies politics?
To further evaluate multi class problems, we can use a confusion matrix. Each element of the confusion matrix, M (i,j), represents the number of samples from class i predicted as class j. Despite being extremely useful in a multi class problem, it is also widely used in binary problems as it helps us to see in which cell of the matrix most of the error occurs.
To better understand the confusion matrix, let’s use it to evaluate our churn problem.
We have a churn model that predicts a customer who will leave the carrier. We make predictions for last month’s data and compare them to the real data (whether the customer actually left the company or not) to evaluate the model. We have plotted the resulting confusion matrix in the figure.
Let’s interpret this matrix.
The top left cell displays the number of observations where the model predicted that the customer would stay and he actually did. These are also called True Negatives. The model predicted a 0 and that was actually the real value.
In the top right cell, the matrix displays the number of observations where the model predicted the customer was not going to leave, but actually did. These are observations where the model failed to predict the correct value. They are usually called False Negatives.
The bottom left cell displays the number of observations for which the model predicted that the customer was going to leave but actually stayed. This means the model predicted a 1 (churn) and the real value was 0 (no churn). It is usually labeled as False Positives.
Finally, the bottom right cell displays the number of cases where the model predicts that the customer will leave and he actually does. This is called True Positives, as the model correctly predicts that the value is 1 (churn), and that is actually its real value.
An unbalanced dataset is one in which one the classes are far larger than one another, meaning that the ratio between the sizes of the positive and negative tends to be a very small value. This is a common issue in business.
The churn case is a good example. In almost any telecom company or bank, the churn rate is very low, and therefore most of the customers are labeled with a 0 as they do not leave. Just a very small fraction of the customers leave a telco or bank every month. In these scenarios, a model that always predicts the majority class yields to a good accuracy performance, though it is ill informative.
So we have two main problems when modeling this unbalanced data:
– Gathering data from the majority class events is very easy but collecting data from minority class events is difficult and results in a comparatively small sample size.
– Accuracy fails to correctly measure the performance of a model on those data sets and we need to use other performance metrics, such as specificity or positive predictive value on the minority class. Thi is because if we have a dataset where just 2% of the customers leave, a simple model that just predicts that everyone will stay will get a 98% accuracy. However, the performance of the model is actually very poor as it fails to capture any customer leaving.
As we have already seen, accuracy can be not very informative in some problems. For this reason, we may use other performance measures. Many performance measures can be derived from the confusion matrix.
– True Positives (TP): Positive samples predicted as such. Top-left cell.
– True Negatives (TN): Negative samples predicted as such. Bottom-right cell.
– False Positives (FP): Negative samples predicted as positive. Top-right cell
– False Negatives (FN): Positive samples predicted as negative. Bottom-left cell
For example, if there are no budget restrictions in the campaign to retain customers, the manager may want to detect all customers that may churn even though it may increase the number of false positives.
In this case, we will look for a model that maximizes the recall. However, if the resources available to call customers in order to retain them are very limited, we may prefer to detect fewer positives, but make sure that all the predicted positives will actually be true positives. Thus, we will aim to maximize the precision.
The Problem of Overfit
Let’s now introduce some nomenclature that will be useful in the following sections. As we already explained, when modeling, we have training data which we use to train or fit our model and test data that we use to evaluate the performance of our model. Accordingly, we can define two type of errors:
– In-sample error: Also known as the training error is the error of the model within the train data.
– Out-of-sample error: Also known as the generalization error. This metric measures the expected error of the model when making predictions on unseen data. Unseen data is data that has not been used to train the model. We usually approximate this quantity holding out some training data for testing or validation purposes. For example in the churn case we could train with a random sample of 70% of the data and reserve 30% for test.
It’s important to remember that in predictive modeling, our main goal is to minimize the generalization error. Our aim is to correctly predict new unseen observations!
However, the generalization error will not be smaller than the training error:
Eout ≥ Ein
Thus we need to minimize the training error, and we want our generalization error to be as close as possible to the training error.