Hadoop – BMC Software | Blogs

Hadoop vs Kubernetes: Will K8s & Cloud Native End Hadoop?

Chrissy Kidd — Fri, 18 Jun 2021 11:00:14 +0000

Apache Hadoop is one of the leading solutions for distributed data analytics and data storage. However, with the introduction of other distributed computing solutions directly aimed at data analytics and general computing needs, Hadoop’s usefulness has been called into question.

There are many debates on the internet: is Hadoop still relevant? Or, is it dead altogether?

In reality, Apache Hadoop is not dead, and many organizations are still using it as a robust data analytics solution. One key indicator is that all major cloud providers are actively supporting Apache Hadoop clusters in their respective platforms.

Google Trends shows how interest in Hadoop reached its peak popularity from 2014 to 2017. After that, we see a clear decline in searches for Hadoop. However, this alone is not a good measurement of Hadoop’s usage in the current landscape. After all, Hadoop can be integrated into other platforms to form a complete analytics solution.

In this article, we will learn more about Hadoop, its usability, and whether it will be replaced by rapidly evolving technologies like Kubernetes and Cloud-Native development.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

What is Hadoop?

Hadoop is an open-source framework that is used to store and process massive datasets efficiently. It is a reliable and scalable distributed computing platform that can be used on commodity hardware.

Hadoop distributes its data storage and analytics workloads across multiple nodes (computers) to handle the work parallelly. This leads to faster, highly efficient, and low-cost data analytics capabilities.

Hadoop modules

Hadoop consists of four main modules that power its functionality:

HDFS. Hadoop Distributed File System is a file system that can run on low-end hardware while providing better throughput than traditional file systems. Additionally, it has built-in fault tolerance and the ability to handle large datasets.
YARN. “Yet Another Resource Negotiator” is used for task management, scheduling jobs, and resource management of the cluster.
MapReduce. MapReduce is a big data processing engine that supports the parallel computation of large data sets. It is the default processing engine available on Hadoop. Currently, however, Hadoop also provides support for other engines such as Apache Tez and Apache Spark.
Hadoop Common. Hadoop Common provides a common set of libraries that can be used across all the other Hadoop modules.

Hadoop benefits

Now, let’s look at some top reasons behind the popularity of Apache Hadoop.

Processing power. Hadoop’s distributed computing model allows it to handle limitless concurrent tasks.
Data safety. Hadoop automatically creates and manages data backups. So, you can simply recover your data from a backup in case of a failure.
Cost. Hadoop’s ability to run on commodity hardware enables organizations to easily deploy a data analytics platform using it. It also eliminates the need for expensive and specialized hardware.
Availability. Hadoop is designed to handle failures at the application layer—which means it provides high availability without relying on hardware.

With its flexibility and scalability, Hadoop quickly gained the favor of both individual data engineers/analysts and corporations. This flexibility extends to types of data Hadoop can collect:

Structured and unstructured
Through different inputs like social media, streamed data, internal collections, etc.

Then Hadoop checks all these data sets and determines the usefulness of each data set. All this is done without having to go through the process of converting data into a single format.

Another feature that elevates Hadoop is its storage capability.

Once a large data set is accumulated, and the required data is extracted, we can simply store the unprocessed data with Hadoop endlessly. This enables users to reference older data easily, and the storage costs are also minimal since Hadoop is running on commodity hardware.

Drawbacks of Hadoop

Apache Hadoop clusters gained prominence thanks to all the above features.

However, as technology advances, new options have emerged, challenging Hadoop and even surpassing it in certain aspects. This, along with the inherent limitations of Hadoop, means it has indeed lost its market lead.

So, what are some drawbacks of Hadoop?

Inefficient for small data sets

Hadoop is designed for processing big data composed of huge data sets. It is very inefficient when processing smaller data sets. Hadoop is not suited and cost-prohibitive when it comes to quick analytics of smaller data sets.

Another reason: Although Hadoop can combine, process, and transform data, it does not provide an easy way to output the necessary data. This limits the available options for business intelligence teams for visualizing and reporting on the processed data sets.

Security concerns

Hadoop includes lax security enforcement by default and does not implement encryption decryption at the storage or network levels. Only Kerberos authentication is officially supported by Hadoop, which is a technology that is difficult to maintain by itself.

In each Hadoop configuration, users need to manually enable security options or use third-party tools to configure secure clusters.

Lack of user friendliness

Hadoop is developed using Java, one of the leading programming languages with a large developer base. However, Java is not the best language for data analytics, and it can be complex for new users.

This can lead to complications in configurations and usage—the user must have thorough knowledge in both Java and Hadoop to properly use and debug the cluster.

Not suitable for real-time analytics

Hadoop is designed with excellent support for batch processing. However, with its limitations in processing smaller data sets and not providing native support for real-time analytics, Hadoop is ill-suited for quick real-time analytics.

Hadoop alternatives

So, what other options to Hadoop are available? While there is no single solution to replace Hadoop outright, there are newer technologies that can reduce or eliminate the need for Hadoop.

Apache Spark

Apache Spark is one solution, provided by the Apache team itself, to replace MapReduce, Hadoop’s default data processing engine. Spark is the new data processing engine developed to address the limitations of MapReduce.

Apache claims that Spark is nearly 100 times faster than MapReduce and supports in-memory calculations. Moreover, it supports real-time processing by creating micro-batches of data and processing them.

The support of Spark for modern languages enables you to interact using your preferred programming languages. Spark offers excellent support for data analytics using languages such as:

Scala
Python
Spark SQL

(Explore our Apache Spark Guide.)

Apache Flink

Another available solution is Apache Flink. Flink is another processing engine with the same benefits as Spark. Flink offers even higher performance in some workloads as it is designed to handle stateful computation over unbounded and bounded data streams.

Will Kubernetes & cloud-native replace Hadoop?

Even with newer and faster data process engines, Hadoop still limits users to its tools and technologies like HDFS and YARN with Java-based tools. But, what if you need to integrate other tools and platforms to get the best for your specfic data storage and analytics needs?

The solution is using Kubernetes as the orchestration engine to manage your cluster.

With the ever-growing popularity of containerized cloud-native applications, Kubernetes has become the leading orchestration platform to manage any containerized application. It offers features such as:

Convenient management
Networking
Scaling
High availability

(Explore our comprehensive Kubernetes Guide.)

Consider this scenario: you want to move to cheap cloud storage options like Amazon S3 buckets and managed data warehouses like Amazon Redshift, Google BigQuery, Panoply. This is not possible with Hadoop.

Kubernetes, meanwhile, can easily plug them into Kubernetes clusters to be accessed by the containers. Likewise, Kubernetes clusters have limitless storage with reduced maintenance responsibilities as cloud providers manage all the day-to-day maintenance and availability of data.

Having the storage sorted, Kubernetes can host different services such as:

Big Data analytics tools (Apache Spark, Presto, Flink)
Data science tools (BigML, Jupyter, NLTK, TensorFlow, PyTorch, MATLAB)
Any other tool within a Kubernetes cluster

This gives you the freedom to use any tools, frameworks, or programming languages you’re already familiar with or the one that’s most suitable for your use case—you’re no longer limited to Java.

(See exactly how containers & K8s work together.)

Portability of Kubernetes

Another factor that uplifts Kubernetes is its portability. Kubernetes can be easily configured to be distributed across many locations and run on multiple cloud environments. With containerized applications, users can easily move between development and production environments to facilitate data analytics in any location without major modifications.

By combining Kubernetes with rapid DevOps and CI/CD pipelines, developers can easily create, test, and deploy data analytics, ML, and AI applications virtually anywhere.

Support of Kubernetes for Serverless Computing

Kubernetes has further eliminated the need to manage infrastructure separately with the support for serverless computing. Serverless computing is a rising technology where the cloud platform automatically manages and scales the hardware resources according to the needs of the application.

Some container-native, open-source, and function-as-a-service computing platforms like fn, Apache OpenWhisk, and nuclio can be easily integrated with Kubernetes to run serverless applications—eliminating the need for technologies like Hadoop.

Some frameworks, like nuclio, are specifically aimed at automating data science pipelines with serverless functions.

With all the above-mentioned advantages, Kubernetes is gradually becoming the perfect choice for managing any big data workloads.

Hadoop handles large data sets cheaply

Like any other technology, Hadoop is also designed to address a specific need—handling large datasets efficiently using commodity hardware.

However, evolving technology trends have given rise to new requirements and use cases. Hadoop is not dead, yet other technologies, like Kubernetes and serverless computing, offer much more flexible and efficient options.

So, like any technology, it’s up to you to identify and utilize the correct technology stack for your needs.

Hadoop Interview Questions

Walker Rowe — Wed, 10 May 2017 00:00:51 +0000

Hadoop Interview Questions

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Q: Is Hadoop a database?
A: No. Hadoop is a write-only file system. But there are other products like Hive and HBase that provide a SQL-like interface to Hadoop for storing data in RDMB-like database structures.

Q: What commands do you use to start Hadoop?
A:start-dfs.sh and start-yarn.dfs

Q: What does Apache Pig do?
A: It is a way to write MapReduce jobs using a far simpler, SQL-like syntax than using Java, which is very wordy.

Q: How do you copy a local file to the HDFS
A: hadoop fs -put filename /(hadoop directory)

Q: What is the Hadoop machine learning library called?
A: Apache Mahout.

Q: How is Spark different than Hadoop?
A: Spark stores data in memory, thus running MapReduce operations much faster than Hadoop, which stores that on disk. Also it has command line interfaces in Scala, Python, and R. And it includes a machine learning library, Spark ML, that is developed by the Spark project and not separately, like Mahout.

Q: Map is MapReduce?
A: Map takes an input data file and reduces it to (key->value) pairs or tuples (a,b,c,d) or other iterable structure. Reduce then takes adjacent items and iterates over them to provide one final result.

Q: What does safemode in Hadoop mean?
A: It means the datanodes are not yet ready to receive data. This usually occurs on startup.

Q: How do you take Hadoop out of safemode?
A: hdfs dfsadmin -safemode leave

Q: What is the difference between the a namenode and datanode?
A: Hadoop is a master-slave model. The namenode is the master. The slaves are the datanodes. The namenode partitions MapReduce jobs and hands off each piece to different datanodes. Datanodes are responsible for writing data to disk.

Q: What role does Yarn play in Hadoop?
A: It is a resource manager. What it does is keep track of available resources (memory, CPU, storage) across the cluster (meaning the machines where Hadoop is running). Then each application (e.g. Hadoop) asks the resource manager what resources are available and doles those out accordingly. It runs two daemons to do this: Scheduler and ApplicationsManager.

Q: How do you add a datanode?
A: You copy the whole Hadoop $HADOOP_HOME folder to a server. Then you set up ssh keys so that the Hadoop user can ssh to that server without having to enter a password. Then you add the name of that server to$HADOOP_HOME/etc/hadoop/slaves. That you run hadoop-daemon.sh –config $HADOOP_CONF_DIR –script hdfs start datanode on the new data node.

Q: How do you see what Hadoop services are running? Name them.
A: Run jps. You should see: DataNode on the datanodes and NameNode, SecondaryNameNode, and ResourceManager on the NameNodes and (optionally) the JobHistoryServer.

Q: How do you start the Hadoop Job History Server?
A:$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh –config $HADOOP_HOME/etc/hadoop start historyserver

Q: What is linear regression?
A: This is a technique used to find a function that most nearly matches a set of data points. For example if you have one independent value x and one dependant variable y then linear regression will calculate the y = mx + b where m is the slope and b is the x intercept. This is used in predictive data models. This is used to find a correlation between variables, for example whether studying more (x) increases student grades (y).

Q: What do you think “Bring the computing to the data instead of bringing the data to the computing” means?
A: This is the whole idea behind Hadoop. It means to use the computing power of commodity virtual machines to process pieces of a dataset instead of having one central powerful computer process that in one place. This model also lets Hadoop operate over datasets of almost unlimited size.

Q: What is Apache Cassandra?
A: This is a noSQL column-oriented database that works in a ring topology. That means it has no central controlling server. It looks like a regular row-and-column RDBMS database, like MySQL, in that it supports a SQL-like syntax. But it groups data by columns for fast retrieval and writes and not rows. And when it writes an item it writes one row-column combination at a time. That means unlike RDBMS there can be rows that omit certain columns.

Q: What are the main Hadoop config files?
A: hdfs-site and core-site.xml

Q: How does Hadoop replication work? What does rack aware mean?
A: The datanodes write data in data blocks, just like a regular disk drive would. It writes a copy of each block to other datanodes depending on the Replication Factor. To say that datanodes are rack aware means it does not write replicas all on the same rack in the data center where a power or other outage would result in the loss of the data and the back-up too.

Q: What are some of the Hadoop CLI options?
A: The full list is: appendToFile cat checksum chgrp chmod chown copyFromLocal copyToLocal count cp createSnapshot deleteSnapshot df du dus expunge find get getfacl getfattr getmerge help ls lsr mkdir moveFromLocal moveToLocal mv put renameSnapshot rm rmdir rmr setfacl setfattr setrep stat tail test text touchz truncate usage

Q: What file types can Hadoop use to store its data?
A: Avro, Parquet, Sequence Files, and Plain Text

Q: What is a NameSpace?
A: It is an abstraction of a directory and file across the cluster. In other words /directory/file is a namespace that represents some file in some directory. But it is not local. It is on the Hadoop cluster, meaning it is stored across the data nodes.

Q: What does Apache Hive us a SQL server like Derby or mysql for?
A: It stores the schema there while it stores the data in Hadoop.

Q: How can you call an external program from Hive, like a Python one:
A: Use TRANSFORM like SELECT TRANSFORM (fields) USING ‘python programName.py’ as (fields) FROM table;

Q: What is Apache Flume?
A: It is a way to write streaming data to Hadoop.

Q: What is Hadoop High Availability?
A: That means configuring a second namenode to work as a hot standby incase the primary namenode crashes.

Q: What does the fsck command do?
A: It checks for bad blocks (i.e., corrupt files) and problems with replication.

Q: What kind of security does Hadoop have? How can you add authentication?
A: Hadoop by default only has file permissions security like a regular UNIX file system. You change permissions on that using chown and chmod and the regular Linx account or LDAP account. But if you want to have a higher level of authentication you would enable Kerberos, which is the authentication system used by Windows and optionally used by Linux. Then the datanodes would need to authenticate to connect to other nodes.

Hadoop Clusters: An Introduction

Walker Rowe — Wed, 10 May 2017 00:00:46 +0000

Hadoop clusters 101

In talking about Hadoop clusters, first we need to define two terms: cluster and node. A cluster is a collection of nodes. A node is a process running on a virtual or physical machine or in a container. We say process because a code would be running other programs beside Hadoop.

When Hadoop is not running in cluster mode, it is said to be running in local mode. That would be suitable for, say, installing Hadoop on one machine just to learn it. When you run Hadoop in local node it writes data to the local file system instead of HDFS (Hadoop Distributed File System).

Hadoop is a master-slave model, with one master (albeit with an optional High Availability hot standby) coordinating the role of many slaves. Yarn is the resource manager that coordinates what task runs where, keeping in mind available CPU, memory, network bandwidth, and storage.

One can scale out a Hadoop cluster, which means add more nodes. Hadoop is said to be linearly scalable. That means for every node you add you get a corresponding boost in throughput. More generally if you have n nodes then adding 1 mode give you (1/n) additional computing power. That type of distributed computing is a major shift from the days of using a single server where when you add memory and CPUs it produces only a marginal increase in throughout.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Datanode and Namenode

The NameNode is the Hadoop master. It consults with DataNodes in the cluster when copying data or running mapReduce operations. It is this design that lets a user copy a very large file onto a Hadoop mount point like /data. Files copied to /data exist as blocks on different DataNodes in the cluster. The collection of DataNodes is what we call the HDFS.

This basic idea is illustrated below.

Yarn

Apache Yarn is a part of Hadoop that can also be used outside of Hadoop as a standalone resource manager. NodeManager takes instructions from the Yarn scheduler to decide which node should run which task. Yarn consists of two pieces: ResourceManager and NodeManager. The NodeManager reports to the ResourceManager CPU, memory, disk, and network usage so that the ResourceManager can decide where to direct new tasks. The ResourceManager does this with the Scheduler and ApplicationsManager.

Adding nodes to the cluster

Adding nodes to a Hadoop cluster is as easy as copying the server name to $HADOOP_HOME/conf/slaves file then starting the DataNode daemon on the new node.

Communicating between nodes

When you install Hadoop, you enable ssh and create ssh keys for the Hadoop user. This lets Hadoop communicate between the nodes by using RCP (remote procedure call) without having to enter a password. Formally this abstraction on top of the TCP protocol is called Client Protocol and the DataNode Protocol. The DataNodes send a heartbeat to the NameNode to let it know that they are still working.

Hadoop nodes configuration

Hadoop configuration is fairly easy in that you do the configuration on the master and then copy that and the Hadoop software directly onto the data nodes without needed to maintain a different configuration on each.

The main Hadoop configuration files are core-site.xml and hdfs-site.xml. This is where you set the port number where Hadoop files can be reached, the replication factor (i.e, the number of replicates or number of copies of data blocks to keep),the location of the FSImage (keeps track of changes to the data files), etc. You can also configure authentication there to put security into the Hadoop cluster,which by default has none.

Cluster management

Hadoop has a command line interface as well an API. But there is no real tool for orchestration (meaning managing, including monitoring) and installing new machines.

There are some options for that. One is Apache Ambari. It is used and promoted by certain Hadoop clouds like Hortonworks.

Here is a view of the Ambari dashboard from HortonWorks:

As you can see it has a lot of metrics and tools not offered by the basic, rather simple, Hadoop and Yarn web interfaces.

It exposes its services as REST web APIs. So other vendors have added it to their operations platform, like the Microsoft Systems Center and Teradata.

With Ambari instead of typing stop-dfs.sh on each data node you can use the rolling restarts feature to reboot each machine when you want to implement some kind of change. As you can image if you have more than a handful of machines then doing this from the command line would be time consuming.

Ambari also helps to manage more than one cluster at the same time. And to build out each you can use the Ambari Blueprint wizard to layout where you want NameNodes, DataNodes, and provide configuration details. This is also useful as you can build development or test clusters and automate the build of those. You only run the wizard one time and then it saves it as an API so that you can script building new clusters.

An Introduction to Hadoop Administration

Walker Rowe — Tue, 25 Apr 2017 04:52:12 +0000

Here we explain some of the most common Hadoop administrative tasks. There are many, so we only talk about some of the main ones. The reader is encouraged to consult the Apache Hadoop documentation to dig more deeply into each topic.

As you work through some admin commands and tasks, you should know that each version of Hadoop is slightly different. They tend to change some of the command script names. In this example we are using Hadoop 2.7.3.

You will need a Hadoop cluster setup to work through this material. Follow our instructions here on how to set up a cluster. It is not enough to run a local-only Hadoop installation if you want to learn some admin tasks.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Common admin tasks

Here are some of common admin tasks:

Monitor health of cluster
Add new data nodes as needed
Optionally turn on security
Optionally turn on encryption
Recommended, but optional, to turn on high availability
Optional to turn on MapReduce Job History Tracking Server
Fix corrupt data blocks when necessary
Tune performance

We discuss some of these tasks below.

Turn on security

By default Hadoop is set up with no security. To run Hadoop in secure mode each user and service authentications with Kerberos. Kerberos is built into Windows and is easily added to Linux.

As for Hadoop itself, the nodes uses RPC (remote procedure calls) to execute commands on other servers. You can set dfs.encrypt.data.transfer and hadoop.rpc.protection to encrypt data transfer and remote procedure calls.

To encrypt data at rest the admin would need to set up an Encryption Key, HDFS Encryption Zone, and Ranger Key Manager Services together with setting up users and roles.

Hadoop web interface URLs

The most common URLs you use with Hadoop are:

NameNode	http://localhost:50070
Yarn Resource Manager	http://localhost:8088
MapReduce JobHistory Server	http://localhost:19888

These screens are shown below.

NameNode Main Screen

Yarn Resource Manager

MapReduce Job History Server

Configure high availability

High Availability sets two two redundant NameNodes in an Active/Passive configuration with a hot standby. Without this, if the NameNode crashes the the cluster cannot be used until the NameNode is recovered. With HA the administrator can fail over to the 2nd NameNode in the case of a failure.

Note that the SecondaryNameNode that runs on the cluster master is not a HA NameNode server. The primary and secondary NameNode servers work together, so the secondary cannot be used as a failover mechanism.

The set up HA you set dfs.nameservices and dfs.ha.namenodes.[nameservice ID] in hdfs-site.xml as well as their IP address and port an mount an NFS directory between the machines so that they can share a common folder.

You run administrative commands using the CLI:

hdfs haadmin

MapReduce job history server

The job history MapReduce server is not installed by default. The configuration and how to start it is shown below.

cat /usr/local/hadoop/hadoop-2.7.3//etc/hadoop/mapred-site.xml


 
mapred.job.tracker
localhost:9001


mapreduce.jobhistory.address
localhost:10020

 
mapreduce.jobhistory.webapp.address
localhost:19888

Start the MapReduce job history server with the following command:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config 
$HADOOP_HOME/etc/hadoop start historyserver
starting historyserver, logging to 
/usr/local/hadoop/hadoop-2.7.3//logs/mapred-hadoop-historyserver-hp.out

And query it like this:

curl http://localhost:19888/ws/v1/history/info

{"historyInfo":{"startedOn":1492004745263,"hadoopVersion":"2.7.3","hadoopBuildVersion":"2.7.3 from baa91f7c6bc9cb92be5982de4719c1c8af91ccff by root source checksum 2e4ce5f957ea4db193bce3734ff29ff4","hadoopVersionBuiltOn":"2016-08-18T01:41Z"}}

Or just login to the webpage.

Add datanode

You can add a datanode without having to stop Hadoop.

The basic steps are to create the Hadoop user and then configure ssh keys with no passcode so that the user can ssh from one server to another without having to enter a password. Update the /etc/hosts files to add the hostname to all the machines in the cluster. Then you zip up and copy the entire $HADOOP_HOME directory on the master to the same target machine in the same directory.

Then you add the new datanode to $HADOOP_HOME/etc/hadoop/slaves.

Then run this command on the new datanode:

hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

Now you should be able to see it show up when you print the topology:

hdfs dfsadmin -printTopology

192.168.1.83:50010 (hadoop-slave-1)
192.168.1.85:50010 (hadoop-slave-2)

Run Pig Mapreduce job

Here is a Pig script you can run to generate a MapReduce job so that you can have a job to track. (If you do not have pig installed you can refer to https://www.bmc.com/blogs/hadoop-apache-pig/)

First create this file sales.csv

Dallas,Jane,20000
Houston,Jim,75000
Houston,Bob,65000
New York,Earl,40000
Dallas,Fred,40000
Dallas,Jane,20000
Houston,Jim,75000

You can copy the file onto itself multiple times to create a very large file so you will have a job that will run for a few minutes.

Then copy it from the local file system to Hadoop.

hadoop fs -copyFromLocal /data

Check that it is there:

hadoop fs -cat /user/sales.csv

Run Pig. Pig with no command line options runs Pig in cluster (aka MapReduce) mode.

Paste in this script:

a = LOAD '/data/sales.csv' USING PigStorage(',') AS (shop:chararray,employee:chararray,sales:int);

Dump a
Describe a

b = group a by employee;

results = FOREACH b generate SUM(a.sales) as sum, a.employee;

Then you can check the different screens for job data.

Common CLI commands

Stop and start Hadoop.	`start_dfs.sh start_yarn.sh`
Format HDFS.	`$HADOOP_HOME/bin/hdfs namenode -format`
Turn off safe mode.	`hadoop dfsadmin -safemode leave`
List processes.	`jps Namenode jobs: 26961 RunJar 28916 SecondaryNameNode 24121 JobHistoryServer 29403 Jps 28687 NameNode 29135 ResourceManager`Datanode jobs; 4231 Jps 3929 DataNode 4077 NodeManager
Corrupt data blocks. Find missing blocks.	`hdfs fsck /`

Monitoring health of nodemanagers

`yarn.nodemanager.health-checker.script.path`	Script path and filename.
`yarn.nodemanager.health-checker.script.opts`	Command line options.
`Command line options.`	Run frequency
`checker.script.interval-ms`
`checker.script.interval-ms`	Timeout.

Other common admin tasks

Setup log aggregation.

Configure rack awareness.

Configure load balancing between datanodes.

Upgrade to newer version.

Use cacheadmin to manage Hadoop centralized cache.

Take snapshots.

Configure user permissions and access control.

Common problems

It is not recommended to use localhost as the URL for the Hadoop file system on the localhost. That will cause it to bind to 127.0.0.1 instead of the machine’s routable IP address. Then in Pig you will get this error:

pig java.net.connectexception connection refused localhost:9000

So set the bind IP address to 0.0.0.0 in etc/hadoop/core-site.xml:

 
fs.defaultFS 
hdfs://0.0.0.0:9000/

WebAppProxy server

Setting up the WebAppProxy server is a security issue. You can use it to set up a proxy server between masters and slaves. It blocks users from using the Yarn URL for hacking. The Yarn user has elevated privileges, which is why that is a risk. It throws up a warning is someone accesses it plus it strips cookies that could be used in an attack.

Where to go from here

The user is encouraged to read further the topics mentioned in this doc and in particular in the Other Common Admin Tasks section as that is where they are going to find tuning and maintenance tools and issues that will certain become issues as they work to maintain a production system and fix all the associated problems.

Introduction to Apache Spark

Walker Rowe — Tue, 25 Apr 2017 00:00:48 +0000

Apache Spark 101

Apache Spark does the same basic thing as Hadoop, which is run calculations on data and store the results across a distributed file system. Spark has either moved ahead or has reached par with Hadoop in terms of projects and users. A major reason for this is Spark operates much faster than Hadoop because it processes MapReduce operations in memory. One recent study said ½ of big data consulting dollars went to Hadoop projects but that Spark had more installations. Since the software is free, it’s difficult to tell.

Spark also has added Spark Streaming to give it the same ability to read streaming data as LinkedIn’s Kafka and Twitter’s Apache Storm.

Spark has items that Hadoop does not. For example, Spark has an interactive shell. That means you can walk through datasets and code one line at a time, fixing errors as you go, which is an easy way to do development. Spark has shells in Scala, Python, and R. But you can also code Spark programs in Java. There is just no REPL (read-eval-print-loop) command line interpreter for Java.

Spark also has a machine learning library, Spark ML. Data scientists writing algorithms using Python probably use scikit-learn. R programmers uses packages from CRAN. But those does not do what Spark ML does, which is work across a distributed architecture. Instead those work only on datasets on a local server. So they would not scale without limit as Spark ML would.

Let’s get started with Apache Spark by introducing some concepts and then writing code.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

RDD basics

The main Spark data structures are the RDD (resilient distributed dataset) and Data Frames.

Let’s explain RDDs with an example.

Spark is written in Scala, which is a functional programming language. Programmers who are not familiar with that compact notation will find it strange and often difficult to understand. But here we will not use the functional-programming coding style in order to make the Scala code easier to understand. If you only know Python, that does not matter as the Scala syntax is simple enough. What is important are the Spark functions. Those are the same in any language.

What Scala does well is work with Java. It runs on the Java JVM and lets programmers instantiate Java objects. So you can use Java objects in your code.

First, let’s open the Spark Scala shell:

spark-shell

When you do that it starts up Spark and establishes a SparkContext which is the basic object upon which all others are based. SparkContext knows the details of the Spark installation and configuration, like what port processes are running on. If you were writing stand-alone jobs instead of the command line shell you would instantiate org.apache.spark. SparkContext with code plus have to start Spark manually.

Here we use Scala to create a List and make an RDD from that. First make a List of Ints.

var a = List(1,2,3,4,5,6)/

New we make an RDD:

var rddA = sc.parallelize(a)

Note: You can leave off the Scala semicolon in most commands. But if you get stuck in the shell you will have to enter one of those to make it quit prompting for values. You do not want to hit (-control-)(-c-) as that would exit the spark-shell.

Parallelize just means to create the object across the distributed architecture so that it can be worked on in parallel. In other words, it becomes an RDD. You need to keep that in mind as items that you create in, say, alphabetical order, will get out of alphabetical order as they are worked on in parallel. This can produce some strange and unwanted results. But there are ways to handle that.

Now we run the map function. Map just means run some operation over every element in an iterable object. A List is definitely iterable. So we can multiply each element in the List by 2 like this:

var mapA = a.map(n => n*2)

Note: The notation (x => y) means declare a variable x and run operation y on it. You can make up any name for the variables. x just stands for some element in the list. If the element was an Array of 2 elements you would write ((a,b) => function(a,b)).

Spark echoes the results:

mapA: List[Int] = List(2, 4, 6, 8, 10, 12)

Now we can sum the items in the list using reduce:

var reduceA = mapA.reduce( (a,b) => a + b)

Spark answers:

reduceA: Int = 42

The whole purpose of reduce is to return one single value, unlike map which creates a new collection. The reduce operations work on adjacent elements in pairs. So (a,b) => a + b first adds 2 + 4 = 6 then 6 + 8 = 15 until we get to 42.

Printing RDDs

There are several ways to print items. Here is one:

rddA.collect.foreach(println);
1
2
3
4
5
6

The thing to notice here is the collect command. Spark is a distributed architecture. Collect causes Spark to reach out to all nodes in the cluster and retrieve them to the cluster where you are running the spark-shell. If you did that with a really large dataset it would overload the memory of the machine.

So, to debug working on a really large dataset, it would better print to just take a few using:

rddA.take(5).foreach(println);

Data frames and SQL and reading from a text file

Data Frames are the next main Spark data structure. Suppose we have this comma-delimited data that shows the fishing catch in kilos by boat and species for a fishing fleet:

species,vessel,kilos
mackerel,Sue,800
mackerel,Ellen,750
tuna,Mary,650
tuna,Jane,160
flounder,Sue,30
flounder,Ellen,40

Delete the first line and then read in the comma-delimited file like shown below.

Note: Spark version 2.0 adds the DataBricks spark-csv module to make working with CSV files, easier including those with headers. We do not use that here as we want to illustrate basic functions.

var fishCSV = sc.textFile("/home/walker/Downloads/fish.csv").map(_.split(","));

Above we ran the map function over the collection created by textFile. Then we read each line and split it into an Array of strings using the _.split(“,”) function. The _ is a placeholder in Scala. We could have written map(l => l.split(“,”)) instead.

Here is the first element. Note that Data Frames use different commands to print out their results than RDDs.

fishCSV.first

res4: Array[String] = Array(mackerel, Sue, 800)

Now, fishCSV has no column names, so it cannot have a SQL schema. So create a class to contain that.

case class Catch(species: String, vessel: String, amount: Int);

Then map through the collection of Arrays and pass the 3 elements species, vessel, and kilos to the Catch constructor:

val f = fishCSV.map(w => Catch(w(0),w(1),w(2).toInt));

We have:

f.first

res7: Catch = Catch(mackerel,Sue,800)

Now create a SQLContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc);

The we use createDataFrame method to create a data frame from the RDD.

val catchDF = sqlContext.createDataFrame(f);

Now we can use show to display the results:

catchDF.show

species	vessel	amount
mackerel	Sue	800
mackerel	Ellen	750
tuna	Mary	650
tuna	Jane	160
flounder	Sue	30
flounder	Ellene	40

To query that with SQL first we have to registerTempTable:

catchDF.registerTempTable("catchDF")

Notice the quote marks around catchDF. Now we can query the columns just as if we were working with a relational database. We want to list which boats caught tuna:

val tunaCatch = sqlContext.sql("select vessel from catchDF where species = 'tuna'");

tunaCatch.show

vessel
Mary
Jane

Broadcasters and accumulators

When you first learn Spark or Pig or any other language used to work with big data, the first program you usually learn is the word count program. That iterates over a text file and then uses map and reduce to count the occurrences of each word.

But that calculation can be incorrect when you are running in cluster mode instead of local model because of the distributed nature of Spark. As we showed above, you can use collect to bring the data all back in one place to put those back into the proper order. But collect is not something you would use with a large amount of data. Instead we have broadcasters and accumulators.

A broadcast variable is a static variable that is broadcast to all the nodes. So each node has a read-only copy of some data that it needs to do further calculations. The main reason for doing this is efficiency as they are stored in serial format to make transferring that data faster. For example, if each node needs a copy of a price list, then calculate that once and send it out to each node instead of having each node create one of those.

You create a broadcast variable like this:

val broadcast =sc.broadcast(Array (1,2,3));

An accumulator, as the name implies, gathers the results of addition operations across the cluster at the main node. This works, because of the associative property of arithmetic a+b = b+a. So it does not matter in what order items are added.

The accumulator is given some initial value and the nodes in the cluster update that as they are running. So it’s one way to keep track of the progress of calculations running across the cluster. Of course, that is just one use case. Also know that programmers can use accumulators on any object for which they have defined the + method.

You declare an accumulator like this:

val accum = sc.accumulator(0)

Transformations

Spark has different set operations including flatMap, union, Intersection, groupbyKey, reduceByKey, and others. Below we show reduceByKey.

This is similar to the regular reduce operation. Except it runs a reduce function on elements with a common key. So the elements have to be in K,V (key, value) format. Here we have the Array e where the value (c,1) is repeated 2 times. So if we sum that we expect to see (c,2), which we do:

var e = Array(("a",1), ("b",1), ("c",1), ("c",1));
    var er = sc.parallelize(e);
val d = er.reduceByKey((x, y) => x + y);
d.collect();

res59: Array[(String, Int)] = Array((a,1), (b,1), (c,2))

Save data

Spark is designed to be resilient.That means it will preserve data in memory even when nodes crash, which they will do when they run out of memory. So it keeps track of data and recalculates datasets as needed so as not to lose them. You can persist data permanently to storage using:

SaveAsTextFile—write the data as text to local the file system or Hadoop.
SaveasObjectFile—store as serialized Java objects. In other words preserve the object as a type, e.g. java.util. Arrays, but store it in a efficient byte format so that it need not be converted back to a Scala object when read back into memory. Remember than Scala and Java are pretty much the same since Scala runs on the Java JVM. (Because of that a few of the Spark commands are available in Python.)
SaveAsSequenceFile—write as a Hadoop sequence file.

Persist data

Persist() or cache() will keep objects in memory available to the node so they do not have to be recomputed if any of the nodes crash of the partition is used for something else or otherwise lost.

MEMORY_ONLY—store data as Java objects in memory.

MEMORY_AND_DISK—store in memory. What does not fit save to disk.

MEMORY_ONLY_SER—store as serialized Java objects, meaning write to disk as Java objects in byte format.

So those are the basic Spark concepts to get you started. As an exercise you could rewrite the Scala code here in Python, if you prefer to use Python. And for further reading you could read about Spark Streaming and Spark ML (machine learning).

An Introduction to Hive

Walker Rowe — Tue, 25 Apr 2017 00:00:47 +0000

Overview

Hive is very similar to Apache Pig. What it does is let you create tables and load external files into tables using SQL. Then it creates MapReduce jobs in Java. Java is a very wordy language so using Pig and Hive is simpler.

Some have said that Hive is a data warehouse tool (Bluntly put, that means an RDBMS used to do analytics before Hadoop was invented.). In fact you can use Apache Sqoop to load data into Hive or Hadoop from a relational database.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Installation

In this document we will introduce Hive mainly by using examples. You need to install Apache Hadoop first. Then you can install MySQL to store metadata. (Or you could use Derby. Here we use MySQL). Hive will create data in Hadoop.

So first install Hadoop. Then execute these steps:

apt-get install mysql-server
apt-get install libmysql-java

Download and extract Hive to /usr/local/hive. Then in .bashrc set:

export HIVE_HOME=/usr/local/hive/apache-hive-2.0.1-bin
export PATH=$PATH:$HIVE_HOME/bin

Make this softlink:

ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar

Then open MySQL to create the schema,etc:

mysql -u root -p
mysql> CREATE DATABASE metastore;
mysql> USE metastore;

Here use the schema that most nearly matches the version you are using. (In this tutorial we are using Hive 2.0.1 but there is no 2.0.1 schema. So we used 2.0.0.)

mysql> SOURCE /usr/local/hive/apache-hive-2.0.1-bin/scripts/metastore/upgrade/mysql/hive-schema-2.0.0.mysql.sql

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY password';

mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'password';
mysql>  flush privileges;

Now edit $HIVE_HOME/conf/hive-site.xml:

    
       javax.jdo.option.ConnectionURL
       jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true
       metadata is stored in a MySQL server
    
    
       javax.jdo.option.ConnectionDriverName
       com.mysql.jdbc.Driver
       MySQL JDBC driver class
    
    
       javax.jdo.option.ConnectionUserName
       hiveuser
       user name for connecting to mysql server
    
    
       javax.jdo.option.ConnectionPassword
       password
       password for connecting to mysql server

If you have set up Hadoop as a cluster you might need to run the next command. You will know if that is the case if you get an error about that.

$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave

Also if you have installed Spark you could choose to execute this to get rid of a warning message-

unset SPARK_HOME

Now run the Hive shell:

hive

Load external CSV File

Much of what you do with Hive is load external files stored in Hadoop so that you can use SQL to work with them. So here is an example.

Download this file (It is an ssh log.) then open it with Google sheets or Excel or use sed and vi to replace the spaces with commas so that it will be comma-delimited.

Then the top of the file looks like this:

1331901012,CTHcOo3BARDOPDjYue,192.168.202.68,53633,192.168.28.254,22,failure,INBOUND,SSH-2.0-OpenSSH_5.0,SSH-1.99-Cisco-1.25,-,-,-,-,-
1331901030,CBHpSz2Zi3rdKbAvwd,192.168.202.68,35820,192.168.23.254,22,failure,INBOUND,SSH-2.0-OpenSSH_5.0,SSH-1.99-Cisco-1.25,-,-,-,-,-

Now copy the file to Hadoop:

hadoop fs -put /home/walker/Downloads/ssh.csv /data/ssh/

Notice that you put the directory name and not the name of the file in Hadoop.
Then we load the file and create the schema all in one step:

CREATE EXTERNAL TABLE IF NOT EXISTS sshlog(
sshtime STRING,
sshkey STRING,
sourceIP STRING,
socket STRING,
targetIP STRING,
port STRING,
status STRING,
direction STRING,
software STRING,
device STRING,
junk1 STRING,
junk2 STRING,
junk3 STRING,
Junk4 STRING,
Junk5 STRING )
   COMMENT 'ssh log'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    location '/data/ssh/';

Python User Defined Function (UDF)

Before we show some SQL commands (Technically they are called HQL, Hive Query Language.) we will show how to write a Python UDF:

Create the Python program ssh.py. Below we explain how it works.

import sys
	import datetime
	import re

def bi(s):
    r='(\d+)\.(\d+)\.(\d+)\.(\d+)'
    g=re.search('(\d+)\.(\d+)\.(\d+)\.(\d+)',s)
    if g:
        return bin(int(g.group(1))) + bin(int(g.group(2))) + bin(int(g.group(3)))

for line in sys.stdin:
	x, y, z = line.strip().split('\t')
	print x,bi(y),bi(z)

Add it to make it available for Hive to execute this command in the Hive shell.

add file /home/walker/Downloads/ssh.py

Then run this transform operation calling the Python UDF:

SELECT TRANSFORM (sshtime, sourceip, targetip) USING 'python /home/walker/Downloads/ssh.py' as (sshtime, sourceIP, targetip) FROM sshlog;

1332017778 0b110000000b101010000b11001010 0b110000000b101010000b10101	NULL	NULL
1332017793 0b110000000b101010000b11001010 0b110000000b101010000b10101	NULL	NULL

The Python function works off stdin and stdout. We split the input into a time and then two IP address fields. Then we run a function to convert the IP address to bit format. The TRANSFORM statement means pass those three values to the UDF.

Exercise: why does it return 2 null columns?

Create table insert data

No we go back and do something simpler – create a table and insert some data into it to illustrate HQL functions.

First create a table. People who know SQL will see that it is almost the same syntax.

create table students (student string, age int);

Then add some data into it:

insert into table students values('Walker', 33);
insert into table students values('Sam', 33);
insert into table students values('Sally', 33);
insert into table students values('Sue', 72);
insert into table students values('George', 56);
insert into table students values('William', 64);
insert into table students values('Ellen', 24);
insert into table students values('Jose', 72);
insert into table students values('Li', 56);
insert into table students values('Chris', 64);
insert into table students values('Ellen', 24);
insert into table students values('Sue', 72);
insert into table students values('Ricardo', 56);
insert into table students values('Wolfgang', 64);
insert into table students values('Melanie', 24);
insert into table students values('Monica', 36);

Select

Now we can run regular SQL commands over that. Remember this is not a relational database. Hadoop as you will recall does not allow updating files: only adding and deleting them. So it creates new files for every operation.

select count(*) from students;

Here we can find all students whose birthday is a multiple of 3 by using the modulo (is divisible by) function. (Hive has many math functions.)

select age,pmod(age,3) from students where pmod(age,3) = 0;

Now, without discriminating against older people, we create two new tables: one with people older than 45 and one for people younger than that:

create table old as select * from students where age > 45;

eate table young as select * from students where age < = 45;

Join

Now we can make the intersection of two tables by showing students from the old and new table whose name is 3 letters long:

select young.student, old.student from young join old on (length(young.student) = 3) and (length(old.student) = 3);

It responds:

OK
Sam	Sue
Sam	Sue

Maps

Hive supports maps, structs, and array complex types. But it does not support the ability to add data to those with SQL yet. So we show the sort of awkward way of doing that below.

First, create a table with a map column, meaning a (key->value) column:

create table prices(product map);

Now we use the students table we created above to stand in as a proxy for this insert operation. You can use any table for that

insert into prices select map("abc", 1) from students limit 1;

Now you can see the data we just inserted:

select * from prices;
OK
{"abc":1}

Where to go next

From here there are many areas where you could focus your learning as Hive has many features. For example you can learn about partitions, the decimal data type, working with dates, using Hive on Amazon, and using Hive with Apache Spark. You could learn about Beeline which is a newer Hive command line interface. (Beeline will replace the Hive cli in the future.) And you can dig into architectural like topics like SerDe, which is the Hive serializer-deserializer and Hive file storage formats.

An Introduction to Hadoop Analytics

Walker Rowe — Tue, 25 Apr 2017 00:00:41 +0000

Hadoop Analytics 101

Apache Hadoop by itself does not do analytics. But it provides a platform and data structure upon which one can build analytics models. In order to do that one needs to understand MapReduce functions so they can create and put the input data into the format needed by the analytics algorithms. So we explain that here as well as explore some analytics functions.

Hadoop was the first and most popular big database. Products that came later, hoping to leverage the success of Hadoop, made their products work with that. That includes Spark, Hadoop, Hbase, Flink, and Cassandra. However Spark is really seen as a Hadoop replacement. It has what Hadoop does not, which is a native machine learning library, Spark ML. Plus it operates much faster than Hadoop since it is an in-memory database.

Regarding analytics packages that work natively with Hadoop – those are limited to Frink and Mahout. Mahout is on the way out so you should not use that. So your best options are to use Flink either with Hadoop or Flink tables or use Spark ML (machine language) library with data stored in Hadoop or elsewhere and then store the results either in Spark or Hadoop.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Analytics defined

In order to understand analytics you need to understand basic descriptive statistics (which we explain below) as well as linear algebra, matrices, vectors, and polynomials, which you learn in college, at least those who pursued science degrees. Without any understanding of that you will never understand what a K-Means classification model or even linear regression means. This is one reason regular programmers cannot always do data science. Analytics is data science.

Analytics is the application of mathematical, statistics, and artificial intelligence to big data. Artificial intelligence is also called machine learning. These are mathematical functions. It is important to understand the logical reasoning behind each algorithm so you can correctly interpret the results. Otherwise you could draw incorrect conclusions,

Analytics requires a distributed scalable architecture because it uses matrices and linear algebra. The multiplication of even two large matrices can consume all the memory on a single machine. So the task to do that has to be divided into smaller tasks.
A matrix is a structure like:

[a11, a12, a13,...,a1n
A21, a22, a23, … a2n
…
A i1, ai2, ai3, …, ain]

These matrices are coefficients to a formula the analyst hopes will solve some business or science problem. For example, the optimal price for their product p might be p = ax + by + c, where x and y are some inputs to manufacturing and sales and c is a constant.

But that is a small example. The set of variables the analyst has to work with is usually much larger. Plus the analyst often has to solve multiple equations like this at the same time. This is why we use matrices.

Matrices are fed in ML algorithms using Scala, Python, R, Java, Python, and other programming languages. These objects can be anything. For example, you can mix and match structures:

Tuple(a,b,a)

Tuple(int, float, complex number, class Object)

Array (“employees:” “Fred”, “Joe”, “John”, “William)

So to use Hadoop to do analytics you have to know how to convert data in Hadoop to different data structures. That means you need to understand MapReduce.

MapReduce

MapReduce is divided into 2 steps: map and reduce. (Some say combine is a third step, but it is really part of the reduce step.)

You do not always need to do both map and reduce, especially when you just want to convert one set of input data to a format that will fit into a ML algorithm. Map runs over the values you feed into it and returns the same number of values that you fed into it but changing that to a new output format. The reduce operation is designed to return one value. That is sort of a simplification, but for the most part true. (Reduce can return more than one output. For example, it does that in the sample program we have shown below.)

MapReduce example program

Here is the Word Count program copied directly from the Apache Hadoop website. Below we show how to run it and explain what each section means.

First you need a create a text file and copy it to Hadoop. Below shows its contents:

hadoop fs -cat /analytics/input.txt

Hello World Bye World

The program will take each word in that line and then create these key->value maps:

(Hello, 1)
(World, 1)
(Bye, 1)
(World, 1)

And then reduce them by summing each value after grouping them by key to produce these key->value maps:

(Hello, 1)
(World, 2)
(Bye, 1)

Here is the Java code.

package com.bmc.hadoop.tutorials;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper{
private final static IntWritable one = new IntWritable(1);

// Text is an ordinary text field, except it is serializable by Hadoop.

private Text word = new Text();

// Hadoop calls the map operation for each line in the input file. This file only has 1 line, // but it splits the words using the StringTokenizer as a first step to createing key->value // pairs.

public void map(Object key, Text value, Context context
 ) throws IOException, InterruptedException {
 StringTokenizer itr = new StringTokenizer(value.toString());
 while (itr.hasMoreTokens()) {
word.set(itr.nextToken());

// This it writes the output as the key->value map (word, 1).

context.write(word, one);
      }
    }
  }

// This reduction code is copied directly from the Hadoop source code. It is copied here // so you can read it here. This reduce writes its output as the key->value pair (key, result)

public static class IntSumReducer
       extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);

// Here we tell what class to use for mapping, combining, and reduce functions.

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

To run this you need to create the /analytics folder in Hadoop first:

hadoop fs -mkdir /analytics

Run the code like this:

yarn jar /tmp/WordCount.jar com.bmc.hadoop.tutorials.WordCount /analytics/input.txt /analytics/out.txt

Then check the output. First use hadoop fs -ls /analytics/out.txt to get the file name as the results are saved in the output.txt folder

hadoop fs -cat  /analytics/out.txt/part-r-00000

Here is the resulting count.

Bye	1
Hello	1
World	2

Descriptive statistics

In order to understand these models analytics it is necessary to understand descriptive statistics. This is the average , mean (μ), variance(σ2), and standard deviation (□) taught in college or highschool. In statistics, the normal distribution is used to calculate the probability of an event. The taller the curve and the closer the points are together the smaller the variance and thus the standard deviation.

The normal curve is a graphical presentation of probability: p(x). For example, the probability that a value is less than or equal to x above is p(x).

Linear regression

Now we talk in general how some of the analytics algorithms work.

Linear regression is a predictive model. In terms of machine learning it is the simplest. Learning it first is a good idea, because its basic format and linear relationship remains the same for more complicated ML algorithms.

In its simplest form, linear regression has one independent variables x and 1 dependant variable y=mx +b.

We can graph that like this:

The line y = mx +b is the one such that the distance from the data points to the purple line is the shortest.

The algorithm for calculating the line in Apache Spark is:

org.apache.spark.ml.regression.LinearRegression

Logistic regression

In a linear regression we look at a line that most nearly fits the data points. For logistic regression we turn that into a probability. This outcome of logistic regression logit is one of two values (0, 1). This is associated with a probability function p(x). By convention when p(x) > 0.5 then logit(p(x))=1 and 0 otherwise.

The logistic regression is the cumulative distribution function shown below where e is the constant e and s is scale parameters.

1 / (1 + e**-((x -□) /s))

If that looks strange do not worry as further down we will see the familiar linear functions we have been using.

To understand that logistic regression represents the cumulative distribution under the normal curve, consider the point where x =□ in the normal curve. At that point x – □ = 0 so 1 / (1 + e**-((x – □) /s))=1 / (1 + e**((1 – 1 )/s)) = 1/ ( 1 + e**0) = 1/ ( 1 +1) = 1/2 = 0.5. That is the shaded area to the left of the y axis where x = □ In other words there is a 50% chance of a value being less that than the mean.The graph showing that is shown here again:

The logistic regression algorithm requires a LabeledPoint structure in Spark ML.

(labels, points)

Which we plugged into the LogisticRegressionWithLBFGS algorithm. The labels must be 1 or 0. The points (features) can be a collection of any values like shown below.

(1: {2,3,4})
(0:{4,5,6})

The data we feed in is called the training set. In other words it is a sampling of actual data. These are taken from past history. This lets the computer find the equation that best fits the data. Then it can make predictions. So if:

model = LogisticRegressionWithLBFGS (LabeldPoint of values)

Then we can use the predict function to calculate the likelihood, which is 1 or 0:

model.predict(1,2,3) = 1 or 0.

Odds

Logistic regression is difficult to understand but easier to understand if you look at it in terms of odds. People who bet on horse races and football certainly understand that.

odds =(probability of successes / probability of failure)

If you roll a dice the odds are (1/7)/(6/7) = ⅙. In logistic regression the outcome is either 0 of 1.So the odds are:

odds (y) = p(y=1)/ p(y=0) =  p(y=1)/ (1 - p(y=1))

Logistic regression is the logarithm of the odds.Remember that the probability comes from some linear function like:

Odds (x) = a1x1 + b1x2 + c1x3 + d

The logarithm of that is :

log(Odds (x)) = log(a1x1 + b1x2 + c1x3 + d)

We undo that by raising by sides to e giving the original odds formula:

odds(x) = e**(a1x1 + b1x2 + c1x3 + d)

If that is difficult to see then consider the situation of a heart disease patient. We have a linear formula obtained derived from years of studying those. We can say that the chance of getting heart disease is some functions of:

(a1 * cholesterol) + (b1 * smoking) + ( c1 + genetic factors) + constant

That is a linear equation.

Support vector machines

With Support Vector Machine (SVM) we are interested in assigning a set of data points to one of two possible outcomes: -1 or 1. (If you are thinking that’s the same as logistic regression you would be just about correct. But there are some nuanced differences.)

How is this useful? One example is classifying cancer into two different types based upon its characteristics. Another is handwriting recognition or textual sentiment analysis, meaning asking whether, for example, a customer comment is positive or negative.

Of course a SVM would not be too useful if we could only pick between two outcomes a or b. But we can expand it to classify data between n possible outcomes by using pairwise comparisons.

To illustrate, if we have the possible outcomes a, b, c, and d we can check for each of these four values in three steps. First we look pair wise at (a,b) then (a,c) then (a,d) and so forth. Or we can look at one versus all as in (a, (b,c,d)), (b, (a,c,d)) etc.

As with logistic and linear regression and even neural networks the basic approach is to find the weights (w) and bias (b) that make these formulae correct mx + b = 1 and mx + b = -1 for the training set x. m and x are vectors and m * x is the dot product m • x.

For example, if m is (m1,m2,m3,m4,m5) and x is (x1,x2,x3,x4,x5) then m •x + b is (m1x1 + m2x2 + m3x3 + m4x4 + m5x5) + b.

For linear regression we used the least squared error approach. We SVM we do something similar. We find something called a hyperplane. This threads the needle and finds a plane or line that separates all the data points such that all the weights and input variables are on one side of the hyperplane or the other. That classifies data into one category or the other.

In two-dimensional space the picture looks like shown below. The line m • x + b = 0 is the hyperplane. It is placed at the maximum distance between m • x + b = 1 and m • x + b = -1, which separates the data into outcomes 1 and -1.

In higher order dimensions the dots look like an indiscernible blob. You can’t easily see a curved hyperplane that separates those neatly, so we map these n dimensional points onto a higher dimension n+1 using some transformation function called a kernel. Then it looks like a flat two-dimensional space again, which is easy to visualize. Then we draw a straight line between those points. Then we collapse the points back down to a space with n dimensions to find our hyperplane. Done.

Sound simple? Maybe not but it provides a mechanical process to carefully weave a dividing line between a set of data in multiple dimensions in order to classify that into one set or another.

Introduction to Apache Pig

Walker Rowe — Tue, 25 Apr 2017 00:00:33 +0000

Apache Pig 101

Apache Pig, developed at Yahoo, was written to make it easier to work with Hadoop. Hadoop was developed by Google. Pig lets programmers work with Hadoop datasets using a syntax that is similar to SQL. Without Pig, programmers most commonly would use Java, the language Hadoop is written in. But Java code is inherently wordy. It would be nicer to have an easier and much shorter way to do Hadoop MapReduce operations. That is what Pig does.

But Pig is not exactly SQL. SQL programmers, in fact, will find some of its data structures a little strange.

Data in Pig is represented in data structures called tuples, with all the other Pig data structures being some variation on that. In its most basic form, a tuple is a comma delimited set of values:

(1,2,3,4,5)
(1,6,7,8,9)

And when you join two tuples, in this case on the first element, it represents the data like this:

(1,2,3,4,5,1,6,7,8,9)

Or this:

(1, {(1,2,3,4,5), (1,2,3,4,5)})

In the first case Pig has joined all the elements of two tuples into one. In the second it has put the join criteria in the first element and created a bag in the second. A bag is a collection of tuples. And individual elements are called atoms. Pig also supports maps in the format (key#value).

These odd structures leave the programmer scratching their head wondering how to unwind all of that so they can query individual atoms. We explain some of those operations below.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Using Pig and how it works

What Pig does is run MapReduce operations across datasets. MapReduce is the fundamental concept behind Hadoop and big data in general. But it means something quite different in Hadoop than, for example, Apache Spark or the Scala programming language. In Hadoop, the map operation means to split datasets into pieces and work on those pieces in parallel. Reduce means put them all back together to deliver the desired dataset. In Spark and Scala, map means to run some operation on every element on a list. Reduce means to calculate a final single value from those operations in Spark.

You start Pig in local model using:

pig -x local

Instead of just Pig:

pig

Which causes it to run in cluster (aka mapReduce) mode. Local model simulates a distributed architecture. In cluster mode, mapReduce uses Apache Yarn to run jobs on the cluster (i.e., a network of machines) and stores the resulting data in HDFS (Hadoop Distributed File System).

Common mistakes

Pig you could say has some quirks that makes it difficult to debug errors, plus the people who maintain Pig keep changing the language, slightly. It does not usually give simple error messages, but shows a lengthy hard-to-read Java stack trace in most cases. (It will save those as a tmp* file in the current directory.) So, before we start looking at some code examples, let’s look at some common mistakes that might cause you problems and frustration right up front:

You need to check what port HDFS is running on in order to open files there. Look in
$HADOOP_HOME/etc/hadoop/core-site.xml for fs.default.namehdfs://localhost:9000
to get the port number.
Be careful copying and pasting left and quote marks from Google Docs or other word processor into the Pig shell. Pig will throw an error unless you use the proper single quote ‘ (ASCII 27).
Either run Pig as root or spend time sorting through permissions issues.
Don’t use (control)(-c-) when you type an incomplete command as it prompts for more input. That will exit the shell. Just type “,” or “;” until it quits the current command. Sometimes that can take quite a few tries.
Type commands in a text editor and paste them in Pig as Pig does not let you edit previous commands easily. For example you cannot use the up cursor and then edit the previous line if it wraps around. You can see the history of commands by typing history.
If you have loaded Hadoop and Pig on your laptop, in order to learn it, run pig in local mode by typing pig -x local if you get any java.net.ConnectException errors. That means Yarn or some service is probably not running or configured correctly.
Even though we called Pig easy, it can be frustrating. For example, in writing this document we found that store is a reserved keyword and cannot be used as a column name. And in writing the sales report we wrote below it took some time to figure that out. Pig did not give any friendly error message like “store is a reserved keyword.”

Load a file

Supposed we have a data file sales.csv like this, with sales by employee by shop. (As we just said, we cannot use store for a column name as it is a reserved word. So we use shop.)

Shop,Employee,Sales

Dallas,Sam,30000
Dallas,Fred,40000
Dallas,Jane,20000
Houston,Jim,75000
Houston,Bob,65000
New York,Earl,40000

We can use Pig to calculate sales by store. First delete the first line that contains the header. Otherwise it will become a row in our data set. Then copy the file to the Hadoop Distributed File system like this:

hdfs dfs -put sales.csv  /user/hadoop/

Now we load it to dataset a. In Pig, datasets are called relations. Here we tell it that the input file is comma-delimited and we assign a schema to the data set using AS. We also give the field types: chararray and int.

a = LOAD 'hdfs://localhost:9000/user/hadoop/sales.csv' USING PigStorage(',') AS (shop:chararray,employee:chararray,sales:int);

Then we have this tuple which we can see by typing:

dump a

(Dallas,Sam,30000)
(Dallas,Fred,40000)
(Dallas,Jane,20000)
(Houston,Jim,75000)
(Houston,Bob,65000)
(New York,Earl,40000)

Note that we could have left off the schema in the relation. In that case you would refer to fields by their relative position $0, $1, $2, … Note also that Pig is said to be lazy. That means it does not actually run the mapReduce function until it is needed, such as printing out the data with dump.

We ask Pig to show us the schema of the data set using describe:

describe a

a: {shop: chararray,employee: chararray,sales: int}

Now we group the data by shop.

b = GROUP a BY shop;

Then we run a group operation, like count, sum, or average:

c = FOREACH b GENERATE group as shop, SUM (a.sales);

This gives us sales total by each shop:

(Dallas,90000)
(Houston,140000)
(New York,40000)

Notice that we referred to the sales column as a.sales even after we created the b relation.

ForEach

Pig lets you define user defined functions (UDF) to run over elements in a tuple in the FOREACH construct. You can code those in Pig, Python, Ruby, or other. And you can use so-called Piggy Bank functions that other persons have written and contributed here or use DataFu. Pig does not have a lot of built-in functions, which is why people are adding their own.

But it does simple arithmetic. For example, here we multiply each sales figure by 2 creating the new dataset b. Notice that we also dropped one column and only took two fields from the original dataset a to create the dataset b.

b = FOREACH a GENERATE shop,sales*2

Flatten

Suppose we load another dataset:

a = LOAD 'hdfs://localhost:9000/user/hadoop/election/health.csv'  USING PigStorage(',') As (procedure, provider, name, address, city, state, zipcode, stateName, discharges, aveCharges, avePayments, aveMedicarePayments);

That took that comma-delimited file and made this simple tuple structure.

describe a
a: {procedure: bytearray,provider: bytearray,name: bytearray,address: bytearray,city: bytearray,state: bytearray,zipcode: bytearray,stateName: bytearray,discharges:
bytearray,aveCharges: bytearray,avePayments: bytearray,aveMedicarePayments: bytearray}

Notice also that it made each field of type bytearray, which is the default if you do not define that explicitly.

Now we run a function. We run the STRSPLIT function which generates two values. Here we tell it to split the field aveCharges field by the “$” sign (hex 24). That field is a string for currency $99999. We cannot do math with that. So we split off the $ sign. STRSPLIT returns a tuple with a field on the left and one on the right. In this case the field on the left is empty since the $ sign is the first character. And the field on the right is the numeric part of the string.

b = FOREACH a GENERATE (procedure),  STRSPLIT(aveCharges, '\\u0024');

Now we have a tuple of a tuples, which is called a bag, plus the field procedure.

dump b
((948 - SIGNS & SYMPTOMS W/O MCC,$34774.21),(,34774.21))

So we flatten the second tuple:

c = FOREACH b GENERATE procedure, FLATTEN($1);

Generating this tuple of elements, to make it easier to work with.

(948 - SIGNS & SYMPTOMS W/O MCC,,15042.00)

Filter

Going back to our original dataset:

(Dallas,Sam,30000)
(Dallas,Fred,40000)
(Dallas,Jane,20000)
(Houston,Jim,75000)
(Houston,Bob,65000)
New York,Earl,40000)

We select only employees who have sold more that $40,000.

x = filter a by sales > 40000

Yields:

(Houston,Jim,75000)
(Houston,Bob,65000)

Join and other set operations

There is no intersection operation in Pig, but there is join. Also there is union, cogroup, cross, union, and a few others. Here, for example, is how join works. Suppose we have a list of students and what class they are taking and then a list of classes and what each costs.

students = LOAD '/root/Downloads/people.txt' USING PigStorage(',') AS (name,class);

(John,German)
(Jane,German)
(Bob,French)

class = LOAD '/root/Downloads/class.txt' USING PigStorage(',') AS (class,fee);

(German,$100)
(French,$200)

Then join them by some element that is common to both tuples.

tuition = join students by class, class by class;

(Bob,French,French,$200)
(Jane,German,German,$100)
(John,German,German,$100)

That shows the class and the cost all in one structure.

Data types

Pig has a small set of data types. For example, it does not have a date data type. Instead we have float, boolean, double, bag, map, string, int, and a few more. Often it is necessary to cast a data type, like this: (int) x, or define its type, like this: x:int.

Save relations

If you shut down the Pig shell (You use quit or [-control-][-c-] to do that.) you loose your datasets. But it saves the history so that you can recreate your work. To save your results permanently use:

store x into 'someName

The log shown on the screen will say something lie:

Successfully stored 2 records in: "file:///root/Downloads/someName"

Which you can view like this:

cat output

Houston    Jim    75000
Houston    Bob    65000

Which when we load back into Pig is a tuple again:

(Houston,Jim,75000)
(Houston,Bob,65000)

So Pig is a great tool for doing ETL (extract, transfer, load) operations on Hadoop data. Pig programs are much shorter than Java code. And its basic principles are not complicated to learn. Simplicity, in fact, is what Yahoo was looking for when they created the language for their own in-house use.

Using Hadoop with Apache Cassandra

Walker Rowe — Tue, 25 Apr 2017 00:00:17 +0000

Overview of Cassandra

Cassandra is a noSQL opensource database.It was developed by Facebook to handle their unique needs to process enormous amounts of data.

To say that it is noSQL does not mean it is unstructured. Data in Cassandra is stored in the familiar row-and-column datasets as a regular SQL database. But there are no relations between the tables. And you can query and write data in Cassandra using CQL (Cassandra Query Language), which is very similar to regular SQL.

But just because it supports SQL does not mean you can do all SQL operations on it. In particular there are no JOIN or GROUP operations or anything that would require extension disk searching and calculation. Instead you are supposed to store data in Cassandra the same way that you would like it presented. That shift in thinking is a complete 180 degree turnaround from what people have trability been taught about RDBMS (relational database management systems).

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Not-Normal

To illustrate what we mean, we need to discuss what it means to normalize data. Data in Cassandra is supposed to be not-normal and flattened. Let’s illustrate that.

Suppose we have this student data:

Student number

name

city

state

zipcode

Someone designing an Oracle database would make two tables out of that:

Student number

name

zipcode

state

city

The common element between the two tables is zipcode. Because when you know the zipcode where someone lives, you know their the city and the state. So why should you repeat that data on every single row? You don’t. You make the data normal. But not with Cassandra.
Cassandra is not concerned with using lots of disk space, which is one reason you don’t do normalization. (Yet its compression algorithm reduces storage 80%.) Instead Cassandra is all about speed and scale.

Architecture

There is no central hub or master-slave topology with Cassandra. Instead it is designed as a ring of nodes, with every node having the same role. The nodes communicate with each other in what is called gossip. Nodes can be added and taken down as needed. The user sets the replication level to indicate how many extra copies of the data to keep to maintain redundancy.

What makes Cassandra different from a regular RDBMS also is that one has to keep in mind how it physically stores and sorts data in order to properly use and fully understand it. That means understanding the CommitLog, Memtable, SSTable, Partitions, and Nodes.

To use an analogy, think of how waves come ashore at the beach. One large wave comes in followed by several smaller ones. Then one giant wave comes ashore and shifts everything around. You can think of the waves as Cassandra memory and the beach sand as Cassandra permanent disk storage.

Cassandra, like Hadoop, is designed to use low-cost commodity storage to deliver a distributed architecture. That means using the hard drives that are attached to the virtual and physical machines in the data center instead of some kind of storage array. Cassandra data is stored in storage as an SSTable. Writes are first written to a CommitLog, so that it can keep track of what changes it needs to apply. The writes are cached in structures called the Memtable. This cache builds up in size until it comes crashing ashore like a wave as Cassandra commits its changes to disk. So actual writes are relatively slow, but reads extremely fast.

There are several of these Memtables being held aloft at any one moment. So the structure of the underlying physical SSTable will be different from the Memtables at any one time. The SSTable is written to in waves, the same as a Linux writes pages of memory to disk when the swap file is full. That is different than a regular RDBMS which update a table each time there is an INSERT or other operation.

Partitions and Nodes

Now, think of a primary key on a database table. The primary key is some unique value coming from one or more fields. In Cassandra the first of these fields denotes the partition key. The other fields in the primary key indicate how data is sorted within that partition. Partitions indicate where data is physically stored (i.e., the node).

For example, you might have this data:

Primary Key
Vehicle ID	Make
1	Ford
2	Ford
3	Chevrolet
4	Chevrolet
5	Chevrolet

Vehicle ID is the partition key. Make is a clustering column. This makes data retrieval very efficient if all the Ford vehicles are stored next to each other as it is sorted that way. In this example the vehicle ID is unique so it is not clear how that storage mechanism helps with efficiency. But consider that you can have a composite partition key. So suppose the vehicle ID is really the plant where the vehicle was made plus some number. So we would have all these vehicles made at the same plant and painted the same color stored close together to speed retrieval:

Plant	Vehicle ID	Color
Mexico	1	red
Mexico	1	red
USA	1	blue

Now, let’s do some hands-on work with Cassandra to further illustrate these concepts. First we install the software.

Installation

Here is the installation on CentOS. One point to notice here is that on the Cassandra website, Cassandra makes their software available as a download and distributes it to Ubuntu repositories, but not for Yum. So Datastax does that.

Note:The private business Datastax has written most of the documentation for Cassandra on the internet. They used to make their own distribution of Cassandra but no longer do that. Now someone needs to step into the void and finish the documentation as the official Cassandra documentation is filled with plenty of To-Do comments noting which pages they need to complete. In fact they are even looking for persons who want to help finish the documentation. So use Datastax documentation for now.

First create a repo file with this content:

cat /etc/yum.repos.d/datastax.repo
[datastax] 
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0

Then install these two items:

yum install dsc30
yum install cassandra30-tools

Now open the Cassandra shell:

cqlsh
 
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh>

Now lets create a table and put some data in it.

KeySpace

The first object to create is a KeySpace. This is a container for replication. Here we make a KeySpace called Library, as we are going to make a table to keep track of which books a person has checked out of the library in our little example.

The class and replication_factor determine how data is replicated. There are many options for that, as there are for partition dispersal. Here we say to make 3 copies of each data.

CREATE KEYSPACE Library
      WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

Tables

Now we create two tables: book and patron.

CREATE TABLE Library.book (       
ISBN text, 
copy int, 
title text,  
PRIMARY KEY (ISBN, copy)
 );

CREATE TABLE  Library.patron (      
ssn int PRIMARY KEY,  
checkedOut set  
);

There are some items to notice here:

We use the field types text and int. Cassandra has the usual column types plus allows user-defined ones.
The first table has 2 fields in the primary key. The second table has 1.
The second table has a set collection column. A set is a collection of unique elements, like {1,2,3}. Cassandra also supports a list, whose elements do not need to be unique, like {1,2,2,3} and a map, where the key has to be unique, like {key->value}.
Other than that the syntax looks exactly like regular Oracle SQL.

Add Data

Let’s add some data. The commands look just like regular SQL. The table name is prefixed by the Keyspace, just like in a regular SQL database the table is denoted database.table.

INSERT INTO  Library.book (ISBN, copy, title) VALUES('1234',1, 'Bible');
INSERT INTO  Library.book (ISBN, copy, title) VALUES('1234',2, 'Bible');
INSERT INTO  Library.book (ISBN, copy, title) VALUES('1234',3, 'Bible');
INSERT INTO  Library.book (ISBN, copy, title) VALUES('5678',1, 'Koran');
INSERT INTO  Library.book (ISBN, copy, title) VALUES('5678',2, 'Koran');

Now list the values:

select * from Library.book;
 

 isbn | copy | title
------+------+-------
 5678 |    1 | Koran
 5678 |    2 | Koran
 1234 |    1 | Bible
 1234 |    2 | Bible
 1234 |    3 | Bible

Next we add data to the patron table. The books that the library patron has checked out is a set: {‘1234′,’5678’}.

In a normalized database, the books and the library patron would be kept in separate tables. But here we flatten everything into one structure to avoid having to do SQL JOIN and other operations that would take time. So even though we put details about the books in the book table we could have added the title, page count, etc. as in the patron table too, in perhaps a tuple or other data structure. Again that is completely the opposite of what programmers have traditionally been taught when they design database tables.

So add some data and print it out:

INSERT INTO Library.patron (ssn, checkedOut) values (123,{'1234','5678'});
cqlsh> select * from Library.patron;
 

 ssn | checkedout
-----+------------------
 123 | {'1234', '5678'}

 INSERT INTO Library.patron (ssn, checkedOut) values (123,{‘1234’,’5678’});

Now, Cassandra is all about performance. So you cannot query fields in the collections column without first making an index:

create index on Library.patron (checkedOut);

Now we can we can show which patrons have check out book number 1234 with a query. Note that we use the contains operator.

select ssn from Library.patron where checkedOut contains '1234';
 
 ssn
-----
 123

This is an overview of Cassandra to get you started. Now you might want to investigate some of the APIs as there are Cassandra APIs for many programming languages. Also you could read use cases as because there are so many big data databases now it would be helpful to see which type people have used in what situation.

An Introduction to Hadoop Architecture

Walker Rowe — Tue, 25 Apr 2017 00:00:02 +0000

Overview

Hadoop is a distributed file system and batch processing system for running MapReduce jobs. That means it is designed to store data in local storage across a network of commodity machines, i.e., the same PCs used to run virtual machines in a data center. MapReduce is actually two programs. Map means to take items like a string from a csv file and run an operation over every line in the file, like to split it into a list of fields. Those become (key->value) pairs. Reduce groups these (key->value) pairs and runs an operation to, for example, concatenate them into one string or sum them like (key->sum).

Reduce can work on any operation that is associative. That is the principle of mathematics a + b = b + a. And it works on more than numbers and any programming object can implement addition, multiplication, and subtraction methods. (Division is not associative since 1 /2 <> 2 /1). The associate property is a requirement of a parallel processing system as Hadoop will get items out of order as it divides them into separate chunks to work on them.

Hadoop is designed to be fault tolerant. You tell it how many times to replication data. Then when a datanode crashes data is not lost.

Hadoop uses a master-slave architecture. The basic premise of its design is to Bring the computing to the data instead of the data to the computing. That makes sense. It stores data files that are too large to fit on one server across multiple servers. Then when is does Map and Reduce operations it further divides those and lets each server in the node do the computing. So each node is a computer and not just a disk drive with no computing ability.

(This article is part of our Hadoop Guide. Use the right-hand menu to navigate.)

Architecture diagram

Here are the main components of Hadoop.

Namenode—controls operation of the data jobs.
Datanode—this writes data in blocks to local storage. And it replicates data blocks to other datanodes. DataNodes are also rack-aware. You would not want to replicate all your data to the same rack of servers as an outage there would cause you to loose all your data.
SecondaryNameNode—this one take over if the primary Namenode goes offline.
JobTracker—sends MapReduce jobs to nodes in the cluster.
TaskTracker—accepts tasks from the Job Tracker.
Yarn—runs the Yarn components ResourceManager and NodeManager. This is a resource manager that can also run as a stand-alone component to provide other applications the ability to run in a distributed architecture. For example you can use Apache Spark with Yarn. You could also write your own program to use Yarn. But that is complicated.
Client Application—this is whatever program you have written or some other client like Apache Pig. Apache Pig is an easy-to-use shell that takes SQL-like commands and translates them to Java MapReduce programs and runs them on Hadoop.
Application Master—runs shell commands in a container as directed by Yarn.

Cluster versus single node

When you first install Hadoop, such as to learn it, it runs in single node. But in production you would set it up to run in cluster node, meaning assign data nodes to run on different machines. The whole set of machines is called the cluster. A Hadoop cluster can scale immensely to store petabytes of data.

You can see the status of your cluster here

http://localhost:50070/dfshealth.html#tab-datanode.

MapReduce example: Calculate e

Here we show a sample Java MapReduce program that we run against a Hadoop cluster. We will calculate the value of the mathematical constant e.

e is the sum of the infinite series Σ i = 0 to n (1 + 1 /n!). Which is:

e = 1 + (1 / (1 * 2)) + (1 / (1 * 2 * 3*) + (1 / (1 * 2* 3* 4)) + …

The further out we calculate n the closer we get to the true value of e.
So we list these 9 values of n in a text file in.txt:

in.txt
1
2
3
4
5
6
7
8
9

In the map operation we will create these key value pairs:

(x, 1!)
(x, 2!)
…
(x,9!)

We use the string x as the key for each key so that the reduce step will collapse those to one key (x,e).

Then we compute sum the running sum Σ i = (1 /n!) and in the reduce step add 1 to the value to yield our approximation of e.

First we copy the file in.txt to the Hadoop file system. (You first need to format than and then create a directory).

hadoop fs -put /home/walker/Downloads/in.txt /data/in.txt

Below is the source code for CalculateE.java: In order to compile and run it you need to:

export  HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
hadoop com.sun.tools.javac.Main CalculateE.java
jar cf e.jar CalculateE*.class
hadoop jar e.jar CalculateE /data/in.txt /data/out.txt

Code:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class CalculateE {

public static class Factorial extends Mapper {public void map(Object key, Text value, Context context) throws IOException, InterruptedException {int n = Integer.parseInt(value.toString());double f = factorial(n);double x = (1 / f);DoubleWritable y = new DoubleWritable(x);Text ky = new Text("x");System.out.println("key=" + ky + " y= " + y);context.write(ky, y);}}public static int factorial(int f) {return ((f == 0) ? 1 : f * factorial(f - 1)); }  public static class DoubleSumReducer extends Reducer {		private DoubleWritable x = new DoubleWritable(0);		public void reduce(Text key, Iterable values, Context context)throws IOException, InterruptedException {			double sum = 0;			for (DoubleWritable val : values) {				sum += val.get();                                System.out.println("val=" + val + " sum=" + sum);			}                        sum += 1;			System.out.println("key=" + key.toString() + " e = " + sum);			x.set(sum);			context.write(key, x);		}	}	public static void main(String[] args) throws Exception {				Path inputPath = new Path(args[0]);	    Path outputPath = new Path(args[1]);	    		Configuration conf = new Configuration();		Job job = Job.getInstance(conf, "calculate e");				FileInputFormat.setInputPaths(job, inputPath);	    FileOutputFormat.setOutputPath(job, outputPath);	   	    		job.setJarByClass(CalculateE.class);		job.setMapperClass(Factorial.class);		job.setReducerClass(DoubleSumReducer.class);		job.setOutputKeyClass(Text.class);		job.setOutputValueClass(DoubleWritable.class);	 		System.exit(job.waitForCompletion(true) ? 0 : 1);	}}

When the program runs it saves the results in the folder /data/out.txt in the file part-r-00000. You can see the final results like this:

hadoop fs -ls /data/out.txt
hadoop fs -cat /data/out.txt/part-r-00000
 x 2.7182815255731922

That is pretty close to the actual value of e, 2.71828. To help debug the program as you write it you can look at the stdout log. Or you can turn on the job history tracking server and look at it with a browser: The stdout output includes many Hadoop messages including our debug printouts:

key=x y= 1.0
key=x y= 0.5
key=x y= 0.16666666666666666
key=x y= 0.041666666666666664
key=x y= 0.008333333333333333
key=x y= 0.001388888888888889
key=x y= 1.984126984126984E-4
key=x y= 2.48015873015873E-5
key=x y= 2.7557319223985893E-6


val=2.7557319223985893E-6 sum=2.7557319223985893E-6
val=2.48015873015873E-5 sum=2.755731922398589E-5
val=1.984126984126984E-4 sum=2.259700176366843E-4
val=0.001388888888888889 sum=0.0016148589065255732
val=0.008333333333333333 sum=0.009948192239858907
val=0.041666666666666664 sum=0.05161485890652557
val=0.16666666666666666 sum=0.21828152557319222
val=0.5 sum=0.7182815255731922
val=1.0 sum=1.7182815255731922
key=x e = 2.7182815255731922

Hadoop storage mechanisms

Hadoop can store its data in multiple file formats, mainly so that it can work with different cloud vendors and products. Here are some:

Hadoop—these have the hdfs:// file prefix in their name.For example hdfs://masterServer:9000/folder/file.txt is a valid file name.

S3—Amazon S3.  The file prefix is s3n://

file—file:// causes Hadoop to use any other distributed file system.

Hadoop also supports Windows Azure Storage Blobs (WASB), MapR, FTP, and others.

Writing to data files

Hadoop is not a database. That means there is no random access to data and you cannot insert rows into tables or lines in the middle of files. Instead Hadoop can only write and delete files, although you can truncate and append to them, but that is not commonly done.

Hadoop as a platform for other products

There are plenty of systems that make Hadoop easier to use and to provide a SQL-like interface. For example, Pig, Hive, and HBase.

Many other products use Hadoop for part of their infrastructure too. This includes Pig, Hive, HBase, Phoenix, Spark, ZooKeeper, Cloudera Impala, Flume, Apache , Oozie, and Storm.

And then there are plenty of products that have written Hadoop connectors to let their product read and write data there, like ElasticSearch.

Hadoop processes and properties

Type jps to see what processes are running. It should list NameNode and SecondaryNameNode on the master and the backup master. DataNodes run on the data nodes.

When you first install it there is no need to change any config. But there are many configuration options to tune it and set up a cluster. Here are a few.

Process or property	Port commonly used	Config file	Name of property config item
NameNode	50070	hdfs-site.xml	dfs.namenode.http-address
SecondaryNameNode	5090	hdfs-site.xml	dfs.namenode.secondary.http-address
fs.defaultFS	9000	core-site.xml	The URL used for file access, like hdfs://master:9000.

dfs.namenode.name.dir	file:///usr/local/hadoop
dfs.datanode.name.dir	file:///usr/local/hadoop
dfs.replication	Set the replication value. 3 is commonly used. That means it will make 3 copie of each data it writes.

Hadoop analytics

Hadoop does not do analytics contrary to popular belief, meaning there is no clustering, linear regression, linear algebra, k-means clustering, decision trees, and other data science tools built into Hadoop. Instead it does the extract and transformation steps of analytics only.

For data science anaytics you need to use Spark ML (machine learning library) or scikit-learn for Python or Cran R for the R language. But only the first one runs on a distributed architecture. For the other two you would have to copy the input data to a single file on a single server. There is the Mahout analytics platform, but the authors of that say they are not going to develop it any more.

Hadoop – BMC Software | Blogs

Hadoop vs Kubernetes: Will K8s & Cloud Native End Hadoop?

What is Hadoop?

Hadoop modules

Hadoop benefits

Drawbacks of Hadoop

Inefficient for small data sets

Security concerns

Lack of user friendliness

Not suitable for real-time analytics

Hadoop alternatives

Apache Spark

Apache Flink

Will Kubernetes & cloud-native replace Hadoop?

Portability of Kubernetes

Support of Kubernetes for Serverless Computing

Hadoop handles large data sets cheaply

Related reading

Hadoop Interview Questions

Hadoop Interview Questions

Hadoop Clusters: An Introduction

Hadoop clusters 101

Datanode and Namenode

Yarn

Adding nodes to the cluster

Communicating between nodes

Hadoop nodes configuration

Cluster management

An Introduction to Hadoop Administration

Common admin tasks

Turn on security

Hadoop web interface URLs

NameNode Main Screen

Yarn Resource Manager

MapReduce Job History Server

Configure high availability

MapReduce job history server

Add datanode

Run Pig Mapreduce job

Common CLI commands

Monitoring health of nodemanagers

Other common admin tasks

Common problems

WebAppProxy server

Where to go from here

Introduction to Apache Spark

Apache Spark 101

RDD basics

Printing RDDs

Data frames and SQL and reading from a text file

Broadcasters and accumulators

Transformations

Save data

Persist data

An Introduction to Hive

Overview

Installation

Load external CSV File

Python User Defined Function (UDF)

Create table insert data

Select

Join

Maps

Where to go next

An Introduction to Hadoop Analytics

Hadoop Analytics 101

Analytics defined

MapReduce

MapReduce example program

Descriptive statistics

Linear regression

Logistic regression

Odds

Support vector machines

Introduction to Apache Pig

Apache Pig 101

Using Pig and how it works

Common mistakes

Load a file

Shop,Employee,Sales