How to write a Hive User Defined Function (UDF) in Java

BY

Here we show how to a write user defined functions (UDF) in Java and call that from Hive. You can then use a UDF in Hive SQL statements. It runs over whatever element you send it and then returns a result. So you would write a function to format strings or even do something far more complex. In this example, we use this … [Read more...]

What is Apache HCatalog? HCatalog Explained

BY

Here we explain what HCatalog is and why it is useful to Hadoop programmers. Basically, HCatalog provides a consistent interface between Apache Hive, Apache Pig, and MapReduce. Since it ships with Hive, you could consider it an extension of Hive. (We have written tutorials here on Apache Pig, MapReduce, and Hive.) Why this Matters To … [Read more...]

Apache Hive Beeline Client, Import CSV File into Hive

BY

Beeline has replaced the Hive CLI in what Hive was formally called HiveServer1. Now Hive is called HiveServer2 and the new, improved CLI is Beeline. Apache Hive says, “HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities … [Read more...]

Graphing Spark Data with HighCharts

BY

Here we look at how to use HighCharts with Spark. HighCharts is a charting framework written in JavaScript. It works with both static and streaming data. So you can make live charts with it. And their collection of charts is a beautiful set of designs, made larger by the annual competition they hold. HighCharts is free for non-commercial use. It … [Read more...]

Basics of Graphing Streaming Big Data

BY

Imagine creating a live chart that updates as data flows in. With this you could watch currency value fluctuations, streaming IOT data, application performance, cybersecurity events, or other data in real time. It is not so hard to create Spark Streaming data. We give an example below. But creating any graphs more elaborate than simple SQL … [Read more...]

K-means Clustering with Apache Spark

BY

Here we show a simple example of how to use k-means clustering. We will look at crime statistics from different states in the USA to show which are the most and least dangerous. We get our data from here. The data looks like this. The columns are state, cluster, murder rate, assault, population, and rape. It already includes a cluster column … [Read more...]

Apache Spark: Working with Streams

BY

In the last two posts we wrote, we explained how to read data streaming from Twitter into Apache Spark by way of Kafka. Here we look at a simpler example of reading a text file into Spark as a stream. We make a simple stock ticker that looks like the screen below when we run the code in Zeppelin. Working with streaming data is quite … [Read more...]

Reading Streaming Twitter feeds into Apache Spark

BY

In part 1 of this blog post we explained how to read Tweets streaming off Twitter into Apache Kafka. Here we explain how to read that data from Kafka into Apache Spark. We broke this document into two pieces, because this second piece is considerably more complicated. Prerequisites First install Kafka as shown in Part 1 to verify that you can … [Read more...]

Working with Streaming Twitter Data Using Kafka

BY

Here we show how to read messages streaming from Twitter and store them in Kafka.  In Part 2 we will show how to retrieve those messages from Kafka and read them into Spark Streaming. Overview People use Twitter data for all kinds of business purposes, like monitoring brand awareness.  Twitter, unlike Facebook, provides this data freely.  So you … [Read more...]