Big Data Technologies (2/2)

Big Data Technologies (2/2)

In this lesson, you’re expected to:
– learn about the core components of Hadoop
– use real-world examples to understand how Hadoop can solve common business problems

Hadoop has a wide ecosystem with a large variety of components. However, the core of Hadoop is made up of two basic components: HDFS and MapReduce.

 HDFS: the Hadoop Distributed File System. This is the component that stores the data distributed through the cluster.
 MapReduce is the component that processes the stored data in parallel with the cluster.

1) HDFS

HDFS is a file systems. This means that HDFS is a way of storing information through the cluster. It sits on top of the native filesystem of the operation system you are using.

It is optimized for a massive amount of data, it is fault-tolerant as data is replicated through multiple nodes, and supports distributed processing through MapReduce.

[Optional] What is the Hadoop Distributed File System (HDFS)?
[Optional] How Does HDFS Work? 
Watch this 2-minute video to learn more: https://www.youtube.com/watch?v=vdkx2xasGlM

2) MapReduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure that performs filtering and sorting and a Reduce() procedure that performs a summary operation.

A MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel and independent manner.

Next, the framework sorts the output of the maps, which become the input to the reduce tasks.

[Optional] Hadoop “Hello World” Example

This example of MapReduce shows the word-count and frequency of a given text.

Read this article to learn more:
https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/hadoop-hello-world-example/

To better understand these concepts, think about the following example.

Suppose we have a large amount of text, like all the transcribed conversations of the call center of a company. The manager of the company might be interested in knowing what are the most repeated words so that he can understand the most common complaints of the customers.This represents a high volume of plain text data that cannot be processed in a relational database. So the data scientists will code a MapReduce task to solve the problem. This MapReduce task will be divided into two main steps: the mappers and the reducers.

1) Each mapper, will perform an operation that involves a single line of the text file. The mapper will divide the line into separate words and will transform each word into a tuple* where the first element is the word (key) and the second element a 1 (value).

Thus, the input of the mappers would be lines of text, and the output tuples of key value pairs, where the keys are the words and the values 1. This operation can be performed in parallel as the task of each mapper is independent from the operation conducted by the remaining mappers.

* A tuple is a finite ordered list of elements. Tuples are sequences, just like lists.

2) Next, the reducers will reduce the outputs of the mappers, by merging all the outputs of the mappers by key and summing their values.

Thus, the input of the mapper will be key value pairs, where the key is words and the value is always 1, and the output will be key value pairs, where the key is the distinct words and the value the frequency with which the word appeared in the whole text.

How does MapReduce work?
Source: blog.sqlauthority.com
[Optional] MapReduce – Distributed Work
Watch this 2-minute video to learn more: https://www.youtube.com/watch?v=LZfCPgQmeRU

Common Business Hadoopable Problems

How do we know if we need Hadoop to solve our problem? 

As we have already seen, the nature of the data that enterprises capture, store, and analyze is complex. It is not necessarily structured into the rows and columns of a table and it comes from a wide variety of sources. In this new scenario, Hadoop is a suitable solution as it is able to store and analyze all kind of data, no matter its size or format.

In the past, many companies were forced to discard valuable data because of the high cost of storing and processing it or maybe just because it was not even possible to do it. However, Hadoop uses commodity servers that enable companies to store and process large amounts of data at a low cost.

Now, let’s look at some real-world Hadoop use cases from Cloudera (the leading provider of Hadoop-based software and services) to better understand business problems:
Source: blog.cloudera.com
1) Risk Modeling
How can banks better understand customers and markets?Summary of the Problem
A large bank took separate data warehouses from multiple departments and combined them into a single global repository in Hadoop for analysis. The bank used the Hadoop cluster to construct a new and more accurate score of the risk in its customer portfolios. The more accurate score allowed the bank to manage its exposure better and to offer each customer better products and advice. Hadoop increased revenue and improved customer satisfaction.

The Challenge

A very large bank with several consumer lines of business needed to analyze customer activity across multiple products to predict credit risk with greater accuracy. Over the years, the bank had acquired a number of regional banks. Each of those banks had a checking and savings business, a home mortgage business, credit card offerings and other financial products. Those applications generally ran in separate silos—each used its own database and application software.

As a result, over the years the bank had built up a large number of independent systems that could not share data easily. With the economic downturn of 2008, the bank had significant exposure in its mortgage business to defaults by its borrowers. Understanding that risk required the bank to build a comprehensive picture of its customers. A customer whose direct deposits to checking had stopped, and who was buying more on credit cards, was likely to have lost a job recently. That customer was at higher risk of default on outstanding loans as a result.

The Solution

The bank set up a single Hadoop cluster containing more than a petabyte of data collected from multiple enterprise data warehouses. With all of the information in one place, the bank added new sources of data, including customer call center recordings, chat sessions, emails to the customer service desk and others.

Pattern matching techniques recognize the same customer across the different sources, even when there were some discrepancies in the identifying information stored. The bank applied techniques like text processing, sentiment analysis, graph creation, and automatic pattern matching to combine, digest, and analyze the data.

The result of this analysis is a very clear picture of a customer’s financial situation, his risk of default or late payment and his satisfaction with the bank and its services. The bank has demonstrated not just a reduction of cost from the existing system, but improved revenue from better risk management and customer retention. While this application was specific to retail banking services, the techniques described —the collection and combination of structured and complex data from multiple silos, and a powerful tool of analytics that combine the data and look for patterns – apply broadly.

A company with several lines of business often has only a fragmentary, incomplete picture of its customers, and can improve revenues. That customer was at higher risk of default on outstanding loans as a result.

2) Customer Churn Analysis 
Why do companies really lose customers?

Summary of the Problem

A large telecommunications provider analyzed call logs and complex data from multiple sources. It used sophisticated predictive models across that data to predict the likelihood that any particular customer would leave. Hadoop helped the telco build more valuable customer relationships and reduce churn.

The Challenge 

A large mobile carrier needed to analyze multiple data sources to understand how and why customers decided to terminate their service contracts.

Were customers actually leaving, or were they merely trading one service plan for another? Were they leaving the company entirely and moving to a competitor? Were pricing, coverage gaps, or device issues a factor? What other issues were important, and how could the provider improve satisfaction and retain customers?

The Solution

The company used Hadoop to combine traditional transactional and event data with social network data. By examining call logs to see who spoke with whom, creating a graph of that social network, and analyzing it, the company was able to show that if people in the customer’s social network were leaving, then the customer was more likely to depart as well.

Jim Rohn