Selecting the Best Model

Selecting the Best Model

In this lesson, you’re expected to:
– learn the factors you need to consider in model selection
– understand the basics of supervised machine learning
– explore the bias-variance tradeoff

Model Selection

When analysts are modeling, they usually do not have a single algorithm but a set of algorithms that a priori* are suitable for the problem. Moreover, each algorithm has many parameters that can be tuned for performance. So how can we choose the algorithm that will perform best at solving our task?

a priorirelating to or denoting reasoning or knowledge which proceeds from theoretical deduction rather than from observation or experience.

To this end, we rely on the concepts explained in the previous slides. We want a model whose in sample error is as small as possible, and where the gap between the test and training errors tends to zero.The fact that the training error represents a limit for the model may lead us to try to minimize it too much and overfit the model.

So remember to look for models with a low generalization error. A model with a low training error, but with a huge gap between it and the test error is not a good solution. When a model is overfitted, it not only captures the underlying patterns but also the noise. Thus, it will have a very low training error, but will fail to correctly predict unseen data.

If we take a look at the following figure, Panel A represents a model that is not overfitted, while Panel B represents an overfitted model.

In the figure, we have blue and green points and the model needs to classify points into their group based on their value of the two coordinates x and y.

Why is Panel B overfitted? The model of panel B is very complex, with a non-smoothed frontier. It even has little islands isolating points. In the center of the figure we can see an island in middle of the green dots where the model predicts that it is more probable to find a blue observation. It does so because there is an isolated blue point in that particular space.

This is a perfect example of overfitting. The model has captured not only the patterns but also the noise or random component of the data, as that point is blue just due to chance or randomness. However, if a new observation appears in that area it would be much more probable that it would be a green point. So how do we find the best model and avoid overfitting?

Suppose we have a set of different classifiers and want to select the “best” one. We want to select the one that yields the lowest error rate. To this end, we use cross-validation techniques. There are several kinds of cross-validation, such as leave-one-out or K-fold cross-validation.

– Leave-one-out: given N observations, the model is trained with N−1 of them and tested with the remaining one. This is repeated N times, once per training set and the result is averaged.

– K-fold cross-validation: the training set is divided into K non-overlapping splits. K-1 splits are used for training and the other one is used for evaluating the performance of the model. This process is repeated K times leaving one split out each time. The results are averaged.

A good recommendation when modeling is the KISS rule!

So if we are modeling a business problem or evaluating the models of our data science team we need to keep in mind that in general if a problem can be solved using a simpler model, do not use a more complex one.

This is the famous KISS principle – Keep It Simple, Stupid!

When we have a very little amount of training data, the training error is very small but test error is large. However, as the number of observations increases the training error gets larger and the test error decreases – both tend to close the gap and converge.

The value towards which both errors converge is called bias, and the difference between this value and the test error is called variance.

To understand the difference between bias and variance, think about the following analogy. 

Suppose you are playing darts. Your aim is to hit the center of the board. Hence, if you have no bias and a high variance, your shots are centered on the target but quite dispersed, like the left panel of the figure below. However, the shots of another player with bias and small variance, might be shifted to the left because he is left-handed. His shots will be very concentrated around number 11. Thus, this player is very accurate but is targeting the wrong number, which represents the right panel of the figure.

Image Source:

If our algorithm shows high bias, we can:

– Add more predictors. Our model is missing relevant information, thus building new variables that further characterize the observations can help to reduce the bias.

– Use a more sophisticated model. High bias usually means poor performance. Maybe we are using too simple a model that is not able to properly fit the training data. We can use a more complex family of models or adjust the hyper-parameters of the models to tune its complexity.

On the other hand, if our algorithm shows high variance:

– Use fewer predictor variables. We could use a feature selection or dimensionality reduction technique to reduce the dimension of our data. This can help decrease the over-fitting of a model.

– Use a simpler model. High variance usually means we are overfitting the model. We can use a simpler model or adjust the hyper-parameters of the model to decrease its complexity.

– Use more training samples. However, we may not have more observations available. We could use over-sampling techniques.

– Use ensemble techniques. Some ensemble techniques such as bootstraping aggregation are specifically designed to reduce classification variance. Ensemble techniques combine the predictions of several simpler models to make more a robust prediction.

Other concepts to keep in mind when modelling for business

Of course, the main goal of any learning process is to achieve maximum predictive power. Thus, we try to minimize the error of the model. However, there are other important properties that we need to consider:

– Simplicity. How many variables does my model have? How complex it is to build them? Can I think of a simpler model with similar predictive power?

– Speed. How long does it take to train the model and to make predictions? Can I use it in real time applications?

– Interpretability. Why did the model predict that a customer will leave the carrier? In business it’s usually desirable to sacrifice predictive power in order to have an interpretable model. It is also important to understand why customers decide to leave the company. In the churn case, usually the commercial department will prefer an interpretable model where we can understand why the model decides whether the customer will leave the company or not, rather than a black box with a higher accuracy rate.

However, accuracy trades off with all of these three properties. So we will need to find a balance and decide the best option depending on the business needs.

Nearest Neighbor Rule

K-Nearest Neighbors (KNN) belongs to a family of models that base the model on the evaluation of a function that depends on the point we are querying and training data.

Nearest Neighbors is a very simple algorithm that sees each training observation as a solved case. Thus, when it needs to predict the value of a new unseen observation, it selects the label of the most similar data example in our training set. A parameter that can be tuned is k.

K is the number of neighbors that are used to make the prediction. In a 1-NN, the estimation is made just based on the most similar training point. However, in a 3-NN, the prediction is based on the three nearest training data points.

Decision Trees 

Decision trees are another simple family of models based on the divide and conquer strategy. The basic idea underlying decision trees is to partition the space into regions.

In classification trees, each region is given the value of a label and all data falling in that part of the space will be predicted as such.

The figure below depicts a 2-D space partitioned by a tree model. In this case, the model tries to classify points into blue and green. This model has performed splits or partitions of the space. All observations falling into the blue area will be classified as blue, and the rest as green. Note that in this particular example, the model would misclassify a green point (it is classified as blue).

The figure below shows a decision tree. It can be seen that each node corresponds to one of the predictors. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf.

So in this example (predicting survival on the Titanic), the algorithm tries to predict if someone will survive. The first partition is done by sex. In this case the algorithm will predict that all females survive. And for males, it will base its decision on age and sibsp (# of siblings / spouses aboard). We can infer that all males older than 9.5 will die.

Enlarged version:

The Random Forest Technique

This is an ensemble learning algorithm. This kind of algorithm trains a set of weak classifiers and aggregates their results to build a final prediction.

The Random Forest technique constructs a multitude of decision trees at training time and finally predicts the class that is the mode of the classes (classification) or mean prediction (regression) of the individual tree.

It also introduces a randomization over the feature selected for building each tree in the ensemble in order to improve diversity in an attempt to reduce variance even more. It has a higher predictive power than just a decision tree but it losses interpretability.

[Optional] Comparing supervised learning algorithms
[Optional] What is Supervised Learning?
Jim Rohn Sứ mệnh khởi nghiệp