Big Data for Business (1/2)

Big Data for Business (1/2)

In this lesson, you’re expected to:
– understand the characteristics of Big Data
– learn about the ‘Four Vs’
– explore the different dimensions of Big Data

Defining Big Data

There is no single definition of Big Data. Hence, in this lesson we will try to explain what is usually understood by Big Data and how it can help to solve business problems.

Wikipedia defines big Data as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications”.

In 2011, McKinsey defined Big Data as the following:

“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”

Image Source:
Thus, Big data is a term that refers to the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.

However, the relevance and importance of Big Data does not rely on the size, but its on its value and on the insights that companies can derive from the data to make better strategic decisions.

[Optional] What is Big Data?
Watch this 2-minute video by Bernard Marr this article to learn more:
Prevalence of Big Data
Source: McKinsey Global Institute (MGI) Report, 2011
Enlarged version:
What’s changing in the realm of big data?
Image Source:
Enlarged version:
[Optional] Big Data: The next frontier for innovation, competition, and productivity
The Four Vs of Big Data
Volume, Velocity and Variety are the three most well known Vs of Big Data.
Volume, Velocity and Variety are the three most well known Vs of Big Data.
1) Volume

Companies collect feeds of data from a wide variety of sources. These sources include internal company data such as business transactions, or customer relationship management (CRM) sytems. But they also include external data such as social media feeds, or Open Data*.

Thus, the volume of data can be huge and it is not feasible to store and process with traditional systems such as SQL databases. Big Data technologies help to overcome this limit, enabling companies to store and process all of their data.

Open Data: the idea that some data should be freely available to everyone to use and republish as they wish, without any restrictions.
2) Velocity

Every 60 seconds, retailer Walmart registers 17,000 transactions, Amazon sells products worth over $83,000 and 278,000 tweets are posted worldwide. Today, streams of data are generated at a high pace and being able to collect it, store it, and process it in real time represents a big challenge.

Online Data in 60 Seconds
Image Source:
Enlarged version:
More recently, additional Vs have been proposed for addition to the model, including variability – the increase in the range of values typical of a large data set – valuewhich addresses the need for valuation of enterprise data – and veracity, which refers to the uncertainty of data.
Image Source:
Enlarged version:
4) Value

This is probably the most important of the Four Vs. Up to this point we have seen that it is not easy to handle Big Data, as it is very voluminous, created at a very high speed, and has different formats.

All of these characteristics refer to the complexity of the data, but the key component is the value that can be extracted from these large datasets. We store and process the data in order to extract value, get insights about the business that can help us make smarter decisions. However, extracting value from the data is also the most complex task, as there is no straightforward recipe.

The Four V’s of Big Data
Enlarged version:
Dimensions of Big Data
When thinking about Big Data we often just associate the term with data that is big in size. This is data that occupies petabytes*instead of megabytes.

But this is a very simplistic view of Big Data – it is much more than just lots of data.

* A petabyte (PB) is equivalent to 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
So to understand the real value and meaning of Big Data, let’s look at Cesar Hidalgo’s definition of Big Data.

Cesar Hidalgo is an Associate Professor of Media Arts and Sciences at MIT and the Director of the Macro Connections group at The MIT Media Lab.

According to Hidalgo, Big Data needs to be big in three different ways. He describes three dimensions that define Big Data:

1) Big in Size

This is the simplest criteria and the best known one. Data has to be big in size.

So how does a bank have Big Data in size? Well a bank has financial data from millions of customers, rather than from just a small sample of a few hundred of the population.

2) Big in Resolution

We may have lots of data from millions of individuals but if this data is very aggregated, it will not be very meaningful and cannot be considered Big Data.

For example, consider if we have financial data from bank customers but we only know their balance at the end of the year – this information is not too useful and the insights we can derive from it will be very limited. However, a bank does have Big Data in resolution as it records every transaction with very fine resolution. In each record you can get information about who did the transaction, at what time, the type of transaction, where it occurred etc.

So if we have the transactions with a space detail of coordinates, this is high resolution. However, if we have the transactions aggregated by zip code, this data will have a lower resolution.

3) Big in Scope

This is probably the most forgotten dimension but the most important one. So, big in scope means that this is data that we can actually derive more insights from than the purpose for which the data was collected.

There are probably many sources of data that are big in resolution and size but not necessarily in scope.

To further understand this dimension, let’s use an example. 

How is banking data big in scope? Well let’s think about the transactions made by customers. This data was originally recorded because the bank needs to keep track of the balance of each customer so that they know how much money there is in each account and can justify it.

However, this data is very meaningful in other ways. From this data we can know the habits of the customer – whether he is a runner or party animal. A customer who tends to pay with credit cards at night could be classified as a party animal, while another customer who pays the gym or his sports supplements could be classified as a runner. Moreover, these records can be used to gain insights into which business are more profitable.

A bank can also use their credit card records, or POS (Point of Sale) terminal records to analyze which businesses are making less cash and therefore are more probable to fall into bankruptcy.

[Optional] How to Transform Big Data into Knowledge
Check out the full interview with Dr. Cesar A. Hidalgo, assistant professor at the MIT Media Lab:
[Optional] 4 Ways Big Data Will Change Every Business
Jim Rohn Sứ mệnh khởi nghiệp