In this lesson, you’re expected to:
– learn the difference between a population and sample
– understand how and when to use sampling
– discover different types of sampling techniques
Difference between a Population & Sample
A population is the full collection of data that we want to study. For example, a telco company wants to study which clients are more susceptible to change their cell phone. In this case, the population would be all the clients of the company.
However, if we were only interested in the behavior of clients aged between 20 and 30, our population would be all customers with that age.
By contrast, a sample is a smaller (but hopefully representative) selection of data from a population used to determine information about that population.
In the previous example, the sample may consist of 10% of randomly chosen customers who are offered a new cell phone.
Enlarged version: http://bit.ly/2nIn06Z
Sampling is useful because collecting data and information about the whole population can be expensive and time consuming.
However, it is extremely important to make sure that the sample is representative of the whole population.
Why? Because we will use the conclusions derived from the sample to understand the behavior of the whole population.
Below are a few things we need to keep in mind before sampling:
• What is my population of interest, and to whom do I want to generalize my conclusions? All my customers or just those between 20 and 30?
• Can I sample the whole population? Do I have the data and can I process it? Maybe there is no need to sample.
• Will my sample be representative? There are three main factors that will influence it:
– size of the sample
– procedure employed for selecting the sample
– the degree of response/participation
What are the steps we have to follow when sampling?
1) Defining the population
Once we decided about our study we need to define the population of interest for that study. For example, all the customers between 20 and 30 years old of my company.
2) Specifying a sampling frame
This a set of items or events possible to measure. For example, I make them an offer of a new cell phone and record whether they accept or not.
3) Select sampling method and determine the size of the sample
Once you’ve started working on your sample, review the process.
Note that the population from which the sample is drawn may not exactly match the population whose behavior we want to understand.
Usually there is a large but not complete overlap between these two groups but not necessarily.
For example, returning to our example of the telco. Our population consisted of all customer whose age was between 20 and 30. But if in the experiment, we are going to send the offer of the new cell phone via email, we will draw our sample from all customers with the right age and whose email address is recorded in the systems of the company.
Selecting the most appropriate technique is not an easy task, and the decision will depends on the type of information we are analyzing and the available resources and data.
We can distinguish between two classes of techniques for sampling: probability sampling and non-probability sampling.
• Probability sampling is used when we need to ensure that each member of a population has a chance of being chosen.
• These are all methods of sampling that utilizes some form of random selection.
• In some probability sampling methods, all members have the same chances of being chosen, and in others not.
• Each member in the sampling frame is given a number, then a kind of lottery system is used to determine which units are to be selected. It provides for the greatest number of possible samples.
• The main disadvantage is that it is not feasible for very large data, and that minority sub-groups could not be sufficiently represented in the study.
• In this method, we select a random starting point and then proceed with the selection of every kth element from there onwards. In this case, k = (population size/sample size).
• The main disadvantage is that the sample could be biased if there is a hidden periodicity of our population.
• For example, suppose 30% of the telco customers were men and and the other 70% women. We could divide customers by sex. Next, if we want our sample size to be 20% of the population, we will choose a sample of 20% of the women stratum, and another 20% of the men stratum.
• This way we ensure all strata or groups are represented in our sample.
• The main problem in these scenarios is that the classes are highly imbalanced. Usually a company has less than a 1% churn rate. Hence, the predictive model will fail to capture the patterns of those churning (they represent an extremely low percentage of the total data).
• Oversampling is frequently used to build random samples, where for example we have 40% of customers who churn and 60% who don’t.
• Once we have randomly selected a number of clusters, there are two possibilities:
– All the members of those clusters are selected for the final sample.
– A subset of elements within those clusters are randomly selected to be included in the sample
• The main difference between stratified and clustered samples, is that in stratified sampling all groups are represented, while in cluster sampling, all groups are not represented.
These methods can be defined as all sampling methods where some members of the population have no chance of being selected or where the probability of selection can’t be accurately determined.
Since the selection of elements is non-random, non-probability sampling doesn’t allow the estimation of sampling errors. Hence, despite it can be easier to get samples they are not as reliable for making conclusions about the whole population of interest.
Suppose we’re interested in knowing the average income of US citizens. Since we cannot get data from all citizens, we decide to take a sample.
Our decision is to randomly select 10,000 phone numbers and call them, and interview the first person who answers the phone. The resulting sample would not be probabilistic, as some people may be more likely to answer the phone than others.
Unemployed people, or those teleworking are more probable to answer the phone as they spend more time at home. In the same way, people without a phone will have no chance of answering the poll.