Using Hive Advanced User Defined Functions with Generic and Complex Data Types

BY

Previously we wrote how to write user defined functions that can be called from Hive. You can write these in Java or Scala. (Python does not work for UDFs per se. Instead you can use those with the Hive TRANSFORM operation.)

Programs that extend org.apache.hadoop.hive.ql.exec.UDF are for primitive data types, i.e., int, string. Etc. If you want to process complex types you need to use org.apache.hadoop.hive.ql.udf.generic.GenericUDF. Complex types are array, map, struct, and uniontype.

Generic functions extend org.apache.hadoop.hive.ql.udf.generic.GenericUDF and implement the 4 interfaces shown below.

class MapUpper extends GenericUDF {override def initialize(args: Array[ObjectInspector]): ObjectInspector = {
}
override def getDisplayString(arg0: Array[String] ) : String = { return "silly me"; }override def evaluate(args: Array[DeferredObject]): Object = {}

This is the same as the simple UDF code, except there are two additional functions: initialize and getDisplay. Those set up an ObjectInspector and display a message if there is an error.

initialize looks at the value passed from Hive SQL to the function. There you check the argument count and type. Then it determines the type of argument that was passed to it. Then the evaluate function uses that typeless-argument, which is contained in org.apache.hadoop.hive.ql.udf.generic.GenericUDF.DeferredObject.

As you can see from the Scala code above, that returns an object of type Object, meaning there is no type definition and no ability for the compiler to find errors (Thus will show up at runtime.).

The initialize function returns the type of argument expected by the evaluate function.

Create Some Hive Map Data

We do not write a complete code example here. Instead we explain how you would set up to write a GenericUDF with a Map data type and give the general code outline above.

First, we create some data of Hive Map type. Run Hive and then execute:

create table students (student map<string,string>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';

This creates a table with one column: a map.

That will then let you parse a line of text line this:

name:Walker-class:algebra-grade:B-teacher:Newton

Then you can load that into Hive like this:

load data inpath '/home/walker/Documents/hive/students.txt' into table students;

Which will then produce this output:

select * from students;
OK
{"name":"Walker","class":"algebra","grade":"B","teacher":"Newton"}

As you can see, each column is a (key->value) map.

Note that you can only load data into a Map column type using something like that. The Hive documentation makes clear that you cannot add values to a Map using SQL:

“Hive does not support literals for complex types (array, map, struct, union), so it is not possible to use them in INSERT INTO…VALUES clauses. This means that the user cannot insert data into a complex datatype column using the INSERT INTO…VALUES clause.”

Run Program in Hive

The way you run a program like this in Hive is to make these Hive jar file available to Hive by setting the classpath to:

export CLASSPATH=/usr/local/hive/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar:/usr/hadoop/hadoop-2.8.1/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.1.jar:/home/walker/Documents/bmc/hadoop-common-2.8.1.jar

All of those are contained in Hive and Hadoop lib folders, except for hadoop-common, which you download from Maven Central.

Add Jar to Hive

After you have written and compiled your program you put it in a jar file. Then in Hive you make it available using, where MapUpper is the name of the example we use here:

add jar /home/walker/Documents/bmc/udf/target/scala-2.12/mapupper_2.12-1.0.jar;create temporary function MapUpper as 'MapUpper';

Then you can run this command to execute the MapUpper function against the student column in the students table.

select MapUpper(student) from students;

This will run some operation on the keys or values and return a new map. Or it could return a primitive type if that is what you need.

Related posts:

Want to Learn More About Big Data and What It Can Do for You?


BMC recently published an authoritative guide on big data automation. It’s called Managing Big Data Workflows for Dummies. Download now and learn to manage big data workflows to increase the value of enterprise data.

Download Now ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

Share This Post


Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Chile. He specializes in big data, analytics, and cloud architecture. Find him on LinkedIn or at Southern Pacific Review, where he publishes short stories, poems, and news.