Hive – BMC Software | Blogs

Using Hive Advanced User Defined Functions with Generic and Complex Data Types

Walker Rowe — Thu, 07 Sep 2017 14:00:15 +0000

Previously we wrote how to write user defined functions that can be called from Hive. You can write these in Java or Scala. (Python does not work for UDFs per se. Instead you can use those with the Hive TRANSFORM operation.)

Programs that extend org.apache.hadoop.hive.ql.exec.UDF are for primitive data types, i.e., int, string. Etc. If you want to process complex types you need to use org.apache.hadoop.hive.ql.udf.generic.GenericUDF. Complex types are array, map, struct, and uniontype.

Generic functions extend org.apache.hadoop.hive.ql.udf.generic.GenericUDF and implement the 4 interfaces shown below.

class MapUpper extends GenericUDF {override def initialize(args: Array[ObjectInspector]): ObjectInspector = {
}override def getDisplayString(arg0: Array[String] ) : String = { return "silly me"; }override def evaluate(args: Array[DeferredObject]): Object = {}

This is the same as the simple UDF code, except there are two additional functions: initialize and getDisplay. Those set up an ObjectInspector and display a message if there is an error.

initialize looks at the value passed from Hive SQL to the function. There you check the argument count and type. Then it determines the type of argument that was passed to it. Then the evaluate function uses that typeless-argument, which is contained in org.apache.hadoop.hive.ql.udf.generic.GenericUDF.DeferredObject.

As you can see from the Scala code above, that returns an object of type Object, meaning there is no type definition and no ability for the compiler to find errors (Thus will show up at runtime.).

The initialize function returns the type of argument expected by the evaluate function.

Create Some Hive Map Data

We do not write a complete code example here. Instead we explain how you would set up to write a GenericUDF with a Map data type and give the general code outline above.

First, we create some data of Hive Map type. Run Hive and then execute:

create table students (student map) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';

This creates a table with one column: a map.

That will then let you parse a line of text line this:

name:Walker-class:algebra-grade:B-teacher:Newton

Then you can load that into Hive like this:

load data inpath '/home/walker/Documents/hive/students.txt' into table students;

Which will then produce this output:

select * from students;
OK
{"name":"Walker","class":"algebra","grade":"B","teacher":"Newton"}

As you can see, each column is a (key->value) map.

Note that you can only load data into a Map column type using something like that. The Hive documentation makes clear that you cannot add values to a Map using SQL:

“Hive does not support literals for complex types (array, map, struct, union), so it is not possible to use them in INSERT INTO…VALUES clauses. This means that the user cannot insert data into a complex datatype column using the INSERT INTO…VALUES clause.”

Run Program in Hive

The way you run a program like this in Hive is to make these Hive jar file available to Hive by setting the classpath to:

export CLASSPATH=/usr/local/hive/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar:/usr/hadoop/hadoop-2.8.1/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.1.jar:/home/walker/Documents/bmc/hadoop-common-2.8.1.jar

All of those are contained in Hive and Hadoop lib folders, except for hadoop-common, which you download from Maven Central.

Add Jar to Hive

After you have written and compiled your program you put it in a jar file. Then in Hive you make it available using, where MapUpper is the name of the example we use here:

add jar /home/walker/Documents/bmc/udf/target/scala-2.12/mapupper_2.12-1.0.jar;create temporary function MapUpper as 'MapUpper';

Then you can run this command to execute the MapUpper function against the student column in the students table.

select MapUpper(student) from students;

This will run some operation on the keys or values and return a new map. Or it could return a primitive type if that is what you need.

How to write a Hive User Defined Function (UDF) in Java

Walker Rowe — Thu, 24 Aug 2017 11:55:58 +0000

Here we show how to write user defined functions (UDF) in Java and call that from Hive. You can then use a UDF in Hive SQL statements. It runs over whatever element you send it and then returns a result. So you would write a function to format strings or even do something far more complex.

In this example, we use this package:

com.example.hive.udf

This works with Hive primitives, meaning a string, integer, or other basic field types in Hive. It is very simple. You only have to write one function call, which implements the evaluate() interface. In the example below, we return a string, but you can return any primitive, such as:

public int evaluate();
public int evaluate(int a);
public double evaluate(int a, double b);
public String evaluate(String a, int b, Text c);
public Text evaluate(String a);
public String evaluate(List a);

If you want to use complex Hive types, meaning maps, arrays, structs, or unions, you use:

org.apache.hadoop.hive.ql.udf.generic.GenericUDF,

There you have to implement two additional methods:

public ObjectInspector initialize(ObjectInspector[] arguments)
public String getDisplayString(String[] children)

And If you want to write user defined aggregation functions, you use:

org.apache.hadoop.hive.ql.exec.hive.UDAF

Aggregation functions are those that run over sets of data. For example, average runs over a set of data.

Technically, the code to do aggregate functions is not much more complicated to write. However, the architectural implementation is complex as this is a type of mapReduce function. That means it runs in parallel in a cluster, with pieces of the calculation coming together at different times. So you have to poll the iterate() function until Hive is done calculating the results then you can use that.

Sample Java Program

First, install Hive following these instructions.

Here we use hive as the CLI although beehive is a newer one that the Hive developers want users to switch too. Beehive includes all the same hive commands. It’s only architecturally different. I prefer hive.

(Note: For the example below, it is easier to run everything as root. Otherwise you will get stuck on permissions issues and will spend time sorting those out. For example, the Hive process might not be able to read the Hadoop data because of file permissions. Adding further complexity, the userid you are using might be different than the user who is running Hive. You could, of course, work through those issues if you want.)

First, we need some data. We will use the Google daily year-to-date stock price. We will use this data in more than one blog post. Download it from here.

Shorten the long filename downloaded to google.csv. Delete the first line (headers) and the last two (includes summary info). Then you will have data like this:

“Feb 16, 2017″,”842.17″,”838.50″,”842.69″,”837.26″,”1.01M”,”0.58″
“Feb 15, 2017″,”837.32″,”838.81″,”841.77″,”836.22″,”1.36M”,”-0.32″
“Feb 14, 2017″,”840.03″,”839.77″,”842.00″,”835.83″,”1.36M”,”0.13″
“Feb 13, 2017″,”838.96″,”837.70″,”841.74″,”836.25″,”1.30M”,”0.49″
“Feb 10, 2017″,”834.85″,”832.95″,”837.15″,”830.51″,”1.42M”,”0.58″
“Feb 09, 2017″,”830.06″,”831.73″,”831.98″,”826.50″,”1.19M”,”0.02″
“Feb 08, 2017″,”829.88″,”830.53″,”834.25″,”825.11″,”1.30M”,”0.08″

As you can see, those quote marks are not helpful when working with numbers. So we will write a simple program to remove them.

First, this is Hive, so it will look to Hadoop for the data and not the local file system. So copy it to Hadoop.

hadoop fs -mkdir /data/stocks
hadoop fs -put /home/walker/Documents/bmc/google.csv /data/stocks

Here I have:

HADOOP_HOME=/usr/hadoop/hadoop-2.8.1
HIVE_HOME=/usr/local/hive/apache-hive-2.3.0-bin

We need jar files from there plus hadoop-common-2.8.1.jar, which you can download here.

So do that then:

export CLASSPATH=/usr/local/hive/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar:/usr/hadoop/hadoop-2.8.1/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.1.jar:/home/walker/Documents/bmc/hadoop-common-2.8.1.jar

Now, you do not need Maven or anything complicated like that. Just pick a folder where you want to write your Java file. Since we put the file in this package:

package com.example.hive.udf;

Java will expect that the .class files be in this folder structure:

com/example/hive/udf

So compile the code, copy the resulting .class file files to the package name folder, then create the JAR file:

javac Quotes.java
mv Quotes.class com/example/hive/udf/
jar cvf Quotes.jar com/example/hive/udf/Quotes.class

Code

Here is the code:

package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Quotes extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().replaceAll("^\"|\"$", ""));
}
}

Now, run hive and add the JAR file. (Note that you it from the local file system and not Hadoop.)

add jar /home/walker/Documents/bmc/Quotes.jar;

And to show that it is there:

list jars

Now create this table:

CREATE TABLE IF NOT EXISTS google(
tradeDate STRING,
price STRING,
open STRING,
high STRING,
low STRING,
vol STRING,
change STRING
)
COMMENT 'google stocks'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

And load data into it:

load data inpath ‘/home/walker/Documents/bmc/google.csv’ into table google;

Now register it as a function. Note that that the class name is the package name plus the name of the Quotes class that we wrote:

create temporary function noQuotes as ‘com.example.hive.udf.Quotes’;

Then you can use the Hive UDF like this:

select noQuotes(price), price from google;

It will show:

2017 2017″
2017 2017″
2017 2017″
2017 2017″
2017 2017″
2017 2017″
2017 2017″
2017 2017″
2017 2017″

Note: The JAR file disappears when you exit Hive. So, for an exercise see how you can make Hive permanently store that.

Apache Hive Beeline Client, Import CSV File into Hive

Walker Rowe — Mon, 14 Aug 2017 07:00:03 +0000

Beeline has replaced the Hive CLI in what Hive was formally called HiveServer1. Now Hive is called HiveServer2 and the new, improved CLI is Beeline.

Apache Hive says, “HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2.”

Here we are going to show how to start the Hive HiverServer2 and load a CSV file into it. The syntax is much different from HiveServer1, which we wrote about here.

The data we load are weather data downloaded from here https://www.ncdc.noaa.gov/cdo-web/results. There is much data there, so in the screen shown below we picked “daily summaries” for Atlanta, Georgia. Pick just one year, 2017.

The data looks like this after the header has been removed. Use a text editor to do that,

GHCND:US1GADK0001,TUCKER 1.3 ENE GA US,20170101,0.30,-9999,-9999
GHCND:US1GADK0001,TUCKER 1.3 ENE GA US,20170102,0.76,-9999,-9999

Now install Hive using the instructions from the Apache web site. And then set $HIVE_HOME. Start Hadoop using start_dfs.sh for a local installation or start_yarn.sh and then start_dfs.sh for a cluster installation.

How Beeline differs from Hive CLI

One big difference you will notice from HiveServer1 to HiveServer2 is you do not need have to manually create schemas in MySQL or Derby. Instead, with HiveServer2, install MySQL then run this command to create the schema:

$HIVE_HOME/bin/schematool -dbType mysql -initSchema

Start Hive Server

Then start the Hive server:

sudo nohup $HIVE_HOME/bin/hiveserver2&

Beeline Hive Shell

Here is how to open the Beeline Hive shell>

$HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10000

Now, the user root will get this error when they run commands:

User: root is not allowed to impersonate anonymous

So when you set up the config files give root permission to do that, if you want to login as user root.

Notice below that we specifically give the root user the rights to execute Hive commands. You can switch the name root to your userid or read the instructions to get more details on the other options.

cat $HIVE_HOME/conf/core-site.xml


hive.server2.enable.doAs
true


hadoop.proxyuser.root.groups
*


hadoop.proxyuser.root.hosts
*

This is the JDBC URL. Because this is a standard it means you can connect to Hive from Java and SQL clients, like Squirrel.

cat $HIVE_HOME/conf/hive-site.xml


javax.jdo.option.ConnectionURL
jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true
metadata is stored in a MySQL server


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
MySQL JDBC driver class


javax.jdo.option.ConnectionUserName
hiveuser
user name for connecting to mysql server


javax.jdo.option.ConnectionPassword
password
password for connecting to mysql server

Now create the weather table:

CREATE table weather(
STATION STRING,
STATION_NAME STRING,
WDATE STRING,
PRCP FLOAT,
WIND INT,
SNOW INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Then load the data from the local file system.

LOAD DATA INPATH '/root/Documents/1031764.csv' OVERWRITE INTO TABLE weather;

Now show that there is some data in it

select count(*) from weather;

_c0
4235
1 row selected (1.914 seconds)

Using SQL With Hive

The SQL language Reference manual for Hive is here. It will be useful to follow along.

Let’s see what is the rainiest day on the month for any month of the year.

First lets show the schema of our table:

describe weather;

col_name	data_type	comment
station	string
station_name	string
wdate	string
prcp	float
wind	int
snow	int

Now, we cannot group on columns on which we have run functions. And we cannot create views. That means we cannot divide this into simpler, concrete steps. All we can do is make nested SQL statements. In this case we do three:

Turn the date field into day of month.
Using that query total rainfall per day of month.
Finally print the results sorted by amount of rain per day of month.

Here is the complete query.

Select b.rain, b.day from
(select sum(prcp) as rain, a.day as day from
(select prcp, dayofmonth(to_date(from_unixtime(unix_timestamp(wdate, "yyyymmddd")))) as day from weather where prcp > 0) a
Group by a.day) b
Order by b.rain;

Looking at the innermost query first, turn unix time into a date object using to_date(). Then use dayofmonth() to get day or month.

select prcp,
dayofmonth(to_date(from_unixtime(unix_timestamp(wdate, "yyyymmddd")))) as day from weather where prcp > 0

That result lacks the sum of precipitation. After all we cannot sum the precipitation by day of month until we have grouped the data into day of month. And remember we said we cannot order by calculated values. We so we need another step for that.

This next SQL shows the rainfall by day. So we need one step more after this to order it by total rainfall and not day.

select sum(prcp) as rain, a.day as day from
(select prcp, dayofmonth(to_date(from_unixtime(unix_timestamp(wdate, "yyyymmddd")))) as day from weather where prcp > 0) a
Group by a.day

Finally, running the first query about Looks like the 6th day of the month is the rainiest.

b.rain	b.day
0.9000000040978193	27
2.240000018849969	17
3.7899999916553497	29
5.250000001862645	9
5.830000016838312	13
5.970000043511391	11
7.929999986663461	25
9.790000105276704	16
10.079999938607216	10
10.439999951049685	14
10.459999982267618	18
11.820000018924475	26
14.159999972209334	28
14.389999927952886	12
15.17000013589859	19
18.720000019297004	15
20.27999988757074	1
20.329999884590507	4
21.089999973773956	31
21.169999765232205	7
26.680000007152557	20
35.399999901652336	30
35.699999986216426	8
36.230000007897615	2
37.28000004403293	24
44.0399998370558	23
47.20999996364117	3
63.980000307783484	5
77.87000086344779	21
82.32000004313886	22
82.78000020980835	6