Big Data Technologies (1/2)

Big Data Technologies (1/2)

In this lesson, you’re expected to: 
– learn the basic technology involved in storing and processing Big Data
– understand what Hadoop is

Traditional relational databases have two main limitations: 

• They are not designed to store and process extremely large volumes of data – they’re not optimized to handle petabytes of data.

• They are usually optimized to handle structured data but they are not appropriate to explore unstructured data.

The problem of data storage

Fortunately, the relation between size and cost has not increased. The size of storage has rapidly increased over the last years, but the price of the disks has also decreased significantly. Hence, the cost of storing data is not a big issue.

The speed of disks has also increased over recent years. However, network transfer speed has not increased proportionally.

The amount of data a company had to store in 2004 was approximately 200 GB, and increased up to 3 TB in 2012. This means that storage capacity has multiplied by 10.

The transfer rate in 2004 was 56.5 mb/s, while in 2012, this speed increased to 210 mb/s. Thus, speed has multiplied by four.

So the increase in capacity has been a lot larger than the increase in transfer rate. Hence, in 2004 we could read the full data in one hour while in 2012, we needed up to 4 hours.

Data Analysis

We cannot process or analyze data until the data has been read. Though the transfer rate is now faster than it was 10 years ago, access to the full data set is significantly slower, as we have lots more data. In order to read 3 TB of data, we need 4 hours.

Hence, the transfer rate is the current bottleneck.

To combat the limitations faced by a traditional relational database, there are other alternatives to managing and analyzing data.

One of these solutions is Hadoop – an open source software framework specifically built to handle large volumes of structured and semi-structured data.

Why Hadoop?
Hadoop exists to solve two main problems when working with large datasets:

• Data Storage
• Data Analysis

Big Data Technologies
Source: TechRadar: Big Data, Forrester Research, 2016
Enlarged Version:
[Optional] Top 10 Hot Big Data Technologies
Distributed Systems

When a single computer is not able to solve the task, instead of using a bigger computer, it’s a better idea to use multiple machines for the job.

There is a great quote from Grace Hopper, who is considered a pioneer of modern day computer science:

“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.”

This is what has been done. However, the task is not easy as it involves a significant amount of programming complexity since we need to keep data and processing in synchronization.

Moreover, there are other problems:
• finite bandwidth
• the amount of data that can be pushed through a pipeline is limited.

And what happens when one of these machines goes down? A partial failure of the system should proportionally affect processing total power.

How did Traditional Systems overcome these limitations?

A strategy to overcome this problem was to process the streams of incoming data filtering and aggregating it, so that we just stored relevant data for the business in the data warehouse and the remaining data (unprocessed raw data and significantly larger in size) was recorded on tapes. This approach has several limitations, as we have to rely on our previous knowledge to decide which data we will store and have access to in order to make decisions.

So what if an analyst wants to analyze the logs of the web to explore whether users navigating through different browsers exhibit different buying habits? Well if this was not planned in advance and that piece of information of the logs was not processed and stored in the databases, we will not be able to conduct this analysis.

What is Hadoop?
The Hadoop Approach

Hadoop is a ”software framework” for distributed storage, processing and analysis of big data sets using the MapReduce programming model. It is:• Scalable
• Fault-tolerant
• Open Source

On Hadoop, data is stored and distributed in the cluster. The main advantage is that data will be processed where it is stored. Hence, data does not need to be transferred through the network.

When a new file is ingested* on Hadoop, this file is divided into pieces and each piece is stored on a different node** of the cluster. Hence, when processing this file, each node will process a different chunk, and no node will have to read the entire file.

Data ingestion is the process of obtaining and importing data for immediate use or storage in a database.

** Node: A node is a point of intersection/connection within a network.

Hadoop is fault-tolerant

“Failure is the defining difference between distributed and local programming” 
– Ken Arnold, ‘CORBA’ designer *

* Common Object Request Broker Architecture (CORBA) was designed to facilitate the communication of systems that are deployed on diverse platforms. CORBA enables collaboration between systems on different operating systems, programming languages, and computing hardware.

In a distributed system, failure is inevitable. In fact, failure is part of the definition of any distributed system. When we design distributed systems, we have to design them with the expectation of failure. Failure is not a probability but a certainty.

So if the mean time of failure of a disk is approximately three years and we have a single node, we will not have to worry too much about it being fault-tolerant as a failure will rarely happen and it can be fixed when it occurs.

Now suppose I have a 100 machine cluster with 10 hard drives per computer. Thus, I have 1000 disks. In this case, you can expect a disk failure per day.

If disk failure occurs on average after a year of use, I will have almost one disk failure per day. Hence, in this situation we need a fault tolerant system, since disk failure is an inherent property of large distributed systems. However, an ideal distributed system should handle failure in a transparent way.

An ideal distributed system should have the following properties:

• Automatic
• Transparent
• Graceful
• Recoverable
• Consistent

Enlarged version:

Advantages of Hadoop

1) Horizontal scalability

The communication between nodes is minimum. If you have more data, you can add extra nodes to the cluster. If an extra node is added, at the same time, the storage capacity and processing power must be increased.

One of the main advantages for businesses is that you can start with a small cluster, and if you need to process more data, can increase the size of the cluster instead of having to buy a new one. With traditional systems, when the machine could not deal with the data, a new one needed to be bought.

2) It is inexpensive
Hadoop can be implemented with standard hardware. You do not need very expensive machines.3) MapReduce
We can process distributed data using the MapReduce paradigm. This makes it easier for programmers who just need to focus on solving business problems and not distributing the code.

4) A distributed storage file system (HDFS)
It has a file system storage that enables us to store data distributed through multiple machines.

[Optional] Hadoop – What is it and why does it matter?
Companies, products, and technologies included in the Big Data Landscape:
Enlarged version:
Jim Rohn Sứ mệnh khởi nghiệp