Machine Learning & Big Data Blog

What is Apache HCatalog? HCatalog Explained

4 minute read
Walker Rowe

Here we explain what HCatalog is and why it is useful to Hadoop programmers.

Basically, HCatalog provides a consistent interface between Apache Hive, Apache Pig, and MapReduce. Since it ships with Hive, you could consider it an extension of Hive.

(We have written tutorials here on Apache Pig, MapReduce, and Hive.)

Why this Matters

To understand why this is important, consider that, for example, MapReduce programs run a map and optional reduce operation over data values to produce key->value pairs. Apache Pig creates tuples of data. And data stored in Hive is in table records, just like a relational database. So the three have different ways of storing data. So they are not so easy to use together.

For example, an Apache Pig tuple can look like this:

(1, {(1,2,3,4,5), (1,2,3,4,5)})

MapReduce stores data in key->value pairs, like this:

{
name -> Walker Rowe,
Position -> freelance technical writer
}

An Apache Hive table looks just like an RDBMS table created with a SQL command:

col_name data_type
station string
station_name string
wdate date
prcp float
wind float
snow float

Using Hive Metadata

How is HCatalog useful? One example is clear right away.

Once you use HCatalog then you no longer have to use file paths or even the schema as Hive knows all about that. For example, the Pig statement below loads a file from a Hadoop file system and specifies the schema.

a = LOAD ‘hdfs://localhost:9000/user/hadoop/sales.csv’ USING PigStorage(‘,’) AS (shop:chararray,employee:chararray,sales:int);

With HCatalog that becomes the far simpler:

a = LOAD ‘sales.csv’ using HCatLoader();

So how does a MapReduce programmer use that. Without HCatalog, they would have to write a program to consume Apache Pig tuples stored on the Hadoop file system.

The goal of HCatalog is to allow Pig and MapReduce to be able to use the same data structures as Hive. Then there is no need to convert data.

This concept is best visualized in these two graphic from HortonWorks

The first shows that all three products use Hadoop to store data. Hive stores its metadata (i.e., schema) in MySQL or Derby. The other two do that using code written into the programs and input and output operations.

The second graphic shows that HCatalog exposes Hive data and metadata to MapReduce and Pig directly. This is done using the interfaces shown in yellow. The end result is that the user can work with Hive tables as if they were MapReduce key->value pairs or Pig tuples.

All of this supports sharing data between programs and programmers too. For example, data can be shared with other programs as a REST web service as HCatalog exposes that. And when an Hcatalog task finishes, it can create a JMS message to signal an Apache Pig program to run.

Interface Abstraction

In the words of Apache HCatalog, “HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written.” That means in addition to your own CSV, JSON, RCFile, and SequenceFile, and ORC file formats you could write your own. For example, here is a discussion of how to do that with Apache Hive.

Using HCatalog and Apache Pig

You can run Pig like this to tell it to use HCatalog.

pig -useHCatalog

Then it uses what is called HCatLoader to work with data managed by HCatalog. But to do this you need to set the PIG_CLASSPATH and PIG_OPTS environment variables to tell Pig where to find the HCatalog tables.

MapReduce

MapReduce code is usually Java code.

HCatalog HCatLoader and HCatStore are implementations of the Hadoop InputFormat and OutputFormat interfaces. That gives you org.apache.hadoop.mapreduce.RecordReader and org.apache.pig.backend.hadoop.executionengine.mapReduceLayer to read data from Hive and run the MapReduce operations over that. Then you save your results back into Hive.

HCatalog Partitions and the Hive Column-oriented Database

If you are familiar with Apache Cassandra, which we wrote about here, then you know that is a column-oriented database. HCatalog does the same thing by letting you create partitions. The whole point with column-oriented databases is you can group common fields on the same storage for fast retrieval. That makes a lot more sense when you want to:

select one_field from table;

Instead of retrieving the whole table, you just retrieves columns that you need, boosting speed, since the data is next to each other, and saving memory too. HCatalog does that by letting you divide tables into files called partitions as well. So you can take advantage of this with Pig and MapReduce by reading data much faster than you would with Pig LOAD or reading files using a buffered stream from Hadoop.

Installation and using the HCatalog CLI

To use HCatalog, first install Hadoop and Hive. See the instructions on the Hadoop and Hive web sites for that. It will take a while. My advise is to use MySQL instead of Apache Derby for the Hive installation as many users have complained on StackOverflow about the difficulty of getting Derby to work. That has been my experience is well.

Hcatalog is installed with Hive. So there is nothing to do to use it except:

export PATH=$PATH:$HIVE_HOME/hcatalog/bin
export HCAT_HOME=$HIVE_HOME/hcatalog

Now run hcat. It should echo some command line options and then return the command prompt.

Now you can run Hive DDL and SQL commands from the command prompt, like this command:

hcat -e ‘show tables’;

In this case, it will show you the weather table that we created in the article Apache Hive Beeline Client, Import CSV File into Hive.

hcat -e ‘describe weather’;

station string
station_name string
wdate string
prcp float
wind int
snow int

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing [email protected].

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.