Hadoop Analytics 101

Apache Hadoop by itself does not do analytics. But it provides a platform and data structure upon which one can build analytics models. In order to do that one needs to understand MapReduce functions so they can create and put the input data into the format needed by the analytics algorithms. So we explain that here as well as explore some analytics functions.

Hadoop was the first and most popular big database. Products that came later, hoping to leverage the success of Hadoop, made their products work with that. That includes Spark, Hadoop, Hbase, Flink, and Cassandra. However Spark is really seen as a Hadoop replacement. It has what Hadoop does not, which is a native machine learning library, Spark ML. Plus it operates much faster than Hadoop since it is an in-memory database.

Regarding analytics packages that work natively with Hadoop – those are limited to Frink and Mahout. Mahout is on the way out so you should not use that. So your best options are to use Flink either with Hadoop or Flink tables or use Spark ML (machine language) library with data stored in Hadoop or elsewhere and then store the results either in Spark or Hadoop.

Analytics defined

In order to understand analytics you need to understand basic descriptive statistics (which we explain below) as well as linear algebra, matrices, vectors, and polynomials, which you learn in college, at least those who pursued science degrees. Without any understanding of that you will never understand what a K-Means classification model or even linear regression means. This is one reason regular programmers cannot always do data science. Analytics is data science.

Analytics is the application of mathematical, statistics, and artificial intelligence to big data. Artificial intelligence is also called machine learning. These are mathematical functions. It is important to understand the logical reasoning behind each algorithm so you can correctly interpret the results. Otherwise you could draw incorrect conclusions,

Analytics requires a distributed scalable architecture because it uses matrices and linear algebra. The multiplication of even two large matrices can consume all the memory on a single machine. So the task to do that has to be divided into smaller tasks.
A matrix is a structure like:

[a11, a12, a13,...,a1n
A21, a22, a23, … a2n
A i1, ai2, ai3, …, ain]

These matrices are coefficients to a formula the analyst hopes will solve some business or science problem. For example, the optimal price for their product p might be p = ax + by + c, where x and y are some inputs to manufacturing and sales and c is a constant.

But that is a small example. The set of variables the analyst has to work with is usually much larger. Plus the analyst often has to solve multiple equations like this at the same time. This is why we use matrices.

Matrices are fed in ML algorithms using Scala, Python, R, Java, Python, and other programming languages. These objects can be anything. For example, you can mix and match structures:


Tuple(int, float, complex number, class Object)

Array (“employees:” “Fred”, “Joe”, “John”, “William)

So to use Hadoop to do analytics you have to know how to convert data in Hadoop to different data structures. That means you need to understand MapReduce.


MapReduce is divided into 2 steps: map and reduce. (Some say combine is a third step, but it is really part of the reduce step.)

You do not always need to do both map and reduce, especially when you just want to convert one set of input data to a format that will fit into a ML algorithm. Map runs over the values you feed into it and returns the same number of values that you fed into it but changing that to a new output format. The reduce operation is designed to return one value. That is sort of a simplification, but for the most part true. (Reduce can return more than one output. For example, it does that in the sample program we have shown below.)

MapReduce example program

Here is the Word Count program copied directly from the Apache Hadoop website. Below we show how to run it and explain what each section means.

First you need a create a text file and copy it to Hadoop. Below shows its contents:

hadoop fs -cat /analytics/input.txt

Hello World Bye World

The program will take each word in that line and then create these key->value maps:

(Hello, 1)
(World, 1)
(Bye, 1)
(World, 1)

And then reduce them by summing each value after grouping them by key to produce these key->value maps:

(Hello, 1)
(World, 2)
(Bye, 1)

Here is the Java code.

package com.bmc.hadoop.tutorials;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<object , Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);

// Text is an ordinary text field, except it is serializable by Hadoop.

private Text word = new Text();

// Hadoop calls the map operation for each line in the input file. This file only has 1 line, // but it splits the words using the StringTokenizer as a first step to createing key->value // pairs.

public void map(Object key, Text value, Context context
 ) throws IOException, InterruptedException {
 StringTokenizer itr = new StringTokenizer(value.toString());
 while (itr.hasMoreTokens()) {

// This it writes the output as the key->value map (word, 1).

context.write(word, one);

// This reduction code is copied directly from the Hadoop source code. It is copied here // so you can read it here. This reduce writes its output as the key->value pair (key, result)

public static class IntSumReducer
       extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      context.write(key, result);

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");

// Here we tell what class to use for mapping, combining, and reduce functions.

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

To run this you need to create the /analytics folder in Hadoop first:

hadoop fs -mkdir /analytics

Run the code like this:

yarn jar /tmp/WordCount.jar com.bmc.hadoop.tutorials.WordCount /analytics/input.txt /analytics/out.txt

Then check the output. First use hadoop fs -ls /analytics/out.txt to get the file name as the results are saved in the output.txt folder

hadoop fs -cat  /analytics/out.txt/part-r-00000

Here is the resulting count.

Bye	1
Hello	1
World	2

Descriptive statistics

In order to understand these models analytics it is necessary to understand descriptive statistics. This is the average , mean (μ), variance(σ2), and standard deviation (□) taught in college or highschool. In statistics, the normal distribution is used to calculate the probability of an event. The taller the curve and the closer the points are together the smaller the variance and thus the standard deviation.

The normal curve is a graphical presentation of probability: p(x). For example, the probability that a value is less than or equal to x above is p(x).

Linear regression

Now we talk in general how some of the analytics algorithms work.

Linear regression is a predictive model. In terms of machine learning it is the simplest. Learning it first is a good idea, because its basic format and linear relationship remains the same for more complicated ML algorithms.

In its simplest form, linear regression has one independent variables x and 1 dependant variable y=mx +b.

We can graph that like this:

The line y = mx +b is the one such that the distance from the data points to the purple line is the shortest.

The algorithm for calculating the line in Apache Spark is:


Logistic regression

In a linear regression we look at a line that most nearly fits the data points. For logistic regression we turn that into a probability. This outcome of logistic regression logit is one of two values (0, 1). This is associated with a probability function p(x). By convention when p(x) > 0.5 then logit(p(x))=1 and 0 otherwise.

The logistic regression is the cumulative distribution function shown below where e is the constant e and s is scale parameters.

1 / (1 + e**-((x -□) /s))

If that looks strange do not worry as further down we will see the familiar linear functions we have been using.

To understand that logistic regression represents the cumulative distribution under the normal curve, consider the point where x =□ in the normal curve. At that point x – □ = 0 so 1 / (1 + e**-((x – □) /s))=1 / (1 + e**((1 – 1 )/s)) = 1/ ( 1 + e**0) = 1/ ( 1 +1) = 1/2 = 0.5. That is the shaded area to the left of the y axis where x = □ In other words there is a 50% chance of a value being less that than the mean.The graph showing that is shown here again:

The logistic regression algorithm requires a LabeledPoint structure in Spark ML.

(labels, points)

Which we plugged into the LogisticRegressionWithLBFGS algorithm. The labels must be 1 or 0. The points (features) can be a collection of any values like shown below.

(1: {2,3,4})

The data we feed in is called the training set. In other words it is a sampling of actual data. These are taken from past history. This lets the computer find the equation that best fits the data. Then it can make predictions. So if:

model = LogisticRegressionWithLBFGS (LabeldPoint of values)

Then we can use the predict function to calculate the likelihood, which is 1 or 0:

model.predict(1,2,3) = 1 or 0.


Logistic regression is difficult to understand but easier to understand if you look at it in terms of odds. People who bet on horse races and football certainly understand that.

odds =(probability of successes / probability of failure)

If you roll a dice the odds are (1/7)/(6/7) = ⅙. In logistic regression the outcome is either 0 of 1.So the odds are:

odds (y) = p(y=1)/ p(y=0) =  p(y=1)/ (1 - p(y=1))

Logistic regression is the logarithm of the odds.Remember that the probability comes from some linear function like:

Odds (x) = a1x1 + b1x2 + c1x3 + d

The logarithm of that is :

log(Odds (x)) = log(a1x1 + b1x2 + c1x3 + d)

We undo that by raising by sides to e giving the original odds formula:

odds(x) = e**(a1x1 + b1x2 + c1x3 + d)

If that is difficult to see then consider the situation of a heart disease patient. We have a linear formula obtained derived from years of studying those. We can say that the chance of getting heart disease is some functions of:

(a1 * cholesterol) + (b1 * smoking) + ( c1 + genetic factors) + constant

That is a linear equation.

Support vector machines

Here we step up one level in complexity. The other two algorithms we looked at were predictive models. Those are supervised models, meaning we know the input variables, like cholesterol, blood sugar, etc. It is supervised meaning we give it a training set. A Support Vector Machines (SVM) is an unsupervised classification model. There is no training set.

An unsupervised classification model looks for some pattern among data points. Data points that are close to each other we divide into clusters. SVM outputs a binary results: yes or no or 1 or 0. So we could use it to, for example, look at temperature and rain readings over time and determine whether we have a drought or a rainy season.

Since we are dealing with more than one simple equation, we are dealing with more than 2 or even 3 dimensions. So we have to think in terms of n dimensions. But humans cannot visualize 4, 5, 6, … dimensions. But something we can visualize is a 2-dimensional plane. So we use that.

We use the same approach as linear regression, which is to find a line that best fits the data. But in this case we are looking for a plane, called a hyperplane. Here we use the same basic y = mx + b formula but in this case m is a vector. We write that as y = Mx + b.

We find the plane that is maximum distance instead of the minimize distance from the data points. This minimizes the error. Data points closed to the place are called support vector points.

In the example below, the values of W**Tx + b >= 0 are above the line and those W**Tx + b < 0 are below the line. (W is a vector. T means which one.) Positive values belong to Class 1. Negative ones below to Class 2.

Similarly, this approach of using distance to find the best line or plane is also used in K-Means Clustering and other algorithms. For K-means we also find the line or plane that minimizes the distance. Then we move the points and recompute and then do that again until we reach convergence, in other words where the points and the line or plane coincide.

So that is a basic start on how you can use Hadoop to do analytics. As you can see this is a complex topic that requires much follow on study. You could start with linear regression and then move onto more complex topics.

Dummies guide on big data and workflow automation

Streamline your data operation by managing big data workflows
Download the Guide ›
Last updated: 04/25/2017

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

About the author

Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Tunisia. He specializes in big data, analytics, and programming languages. Find him on LinkedIn or Upwork.