Using Apache Hive with ElasticSearch


Here we explain how to use Apache Hive with ElasticSearch. We will copy an Apache webserver log into ElasticSearch then use Hive SQL to query it.

Why do this? Hive lets you write user defined functions and use SQL (actually HQL) which is easier to work with and provides more functions that ElasticSearch, whose query language is Lucene Query. For example you can join sets of data with Hive. And you can run advanced analytics against Hive using Spark ML (machine learning) or other tool.

As we pointed out before, some tutorials are written to show how to store Hive data in ElasticSearch. But that is not logical as the whole goal of ES is to gather logs from webservers, firewalls, etc. and put them in one place (ES) where they can be queried. This is for cybersecurity and operations monitoring.

ElasticSearch provides the elasticsearch-hadoop connector to let you read (and write) ES documents. What happens when you do that is creates data in Hive tables from ES. Hive does not store the data in ES.

Install Hive

First, install Hive using these directions. After this MySQL will be running. If it is not you will get error Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.

And install Hadoop.

Then download the Hive-Hadoop connector from here and copy it to $HIVE_HOME/lib

cp /home/walker/Documents/jars/elasticsearch-hadoop-5.5.2.jar $HIVE_HOME/lib

Start Hadoop using It is only necessary to start Yarn too if you are running on a cluster.

Start Hive

Start hive using:


Or you can run it in debug mode to see and then fix errors with your installation. For example if you forget to copy the connector jar to the lib folder it will throw a class not found error.

hive --hiveconf hive.root.logger=DEBUG,console

You can also see stdout here /tmp/(your userid)/hive.log

Install ElasticSearch and Load Some Data

Explaining how to install ES is beyond the scope of this data. You can follow their instructions here.

Download Sample Data and Upload

We need some data to analyze. You can download a sample Apache log from here.

You will need to install unrar to unzip it:

unrar x apache-access_log.rar

Then copy this Apache logstash config from here and name it apache.conf. Put it in the root folder of the logstash installation.

input {
file {
path => "/tmp/access_log"
start_position => "beginning"
filter {
if [path] =~ "access" {
mutate { replace => { "type" => "apache_access" } }
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
output {
elasticsearch {
hosts => ["localhost:9200"]
stdout { codec => rubydebug }

Then start logstash with that config file.

bin/logstash -f apache.conf

Then copy the data to the /tmp/access_log file path indicated in the apache.conf file.

cp access_log /tmp/

The screen will start to echo the output like this as logstash goes to work:

"path" => "/tmp/access_log",
"@timestamp" => 2017-09-21T12:16:33.346Z,
"@version" => "1",
"host" => "eurovps",
"message" => " - - [12/Mar/2004:13:25:45 -0800] \"GET /SpamAssassin.html HTTP/1.1\" 200 7368\r",
"type" => "apache_access",
"tags" => [
[0] "_grokparsefailure"

Now if you query ES from the command line you should see the new index and document count:

curl http://localhost:9200/_cat/indices?vhealth status index      uuid      pri rep docs.count docs.deleted store.size open logstash-2017.09.21 DBRrjIlvQtG67ddb7qUnxw 5 1 1550

Then open Kibana and see the data there as well.

Now tell Hive to read this data by creating a table like this:

path string,
timex timestamp,
Version int,
Host string,
Message string,
Index string
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES( 'es.nodes.wan.only' = 'true', 'es.resource' = 'logstash-2017.09.21', 'es.query' = '?q=*');

es.resource is the ES index.
Es.nodes.wan.only will solve network connectivity problems (i.e., if Hive complains about that.)

Host will default to localhost.

ES will create this:

describe apachelog;
path          string          from deserializer
timex         timestamp       from deserializer
version       int             from deserializer
host          string          from deserializer
message       string          from deserializer
index         string          from deserializer

Now, we did not set up grok correctly or it would have parsed the message field in Apache. But we cal pull out the IP address ourselves like this:

select regexp_extract(message, '^([0-9]{1,3}\\.){3}[0-9]{1,3}',0) as ip from apachelog;

Now we can run a query to show which IP addresses have access the web server the most number of times. We do this in two steps:

  1. Create intermediate table iptab.
  2. Query table iptab.

create table iptab as select regexp_extract(message, '^([0-9]{1,3}\\.){3}[0-9]{1,3}',0) as ip from apachelog;

Here is the count:

select count(ip) as c, ip from iptab group by ip order by c desc;452

Related posts:

Wikibon: Automate your Big Data pipeline

Learn how data management experts throughout the industry are transforming their Big Data infrastructure for maximum business impact.

Download Now ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

Share This Post

Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Chile. He specializes in big data, analytics, and cloud architecture. Find him on LinkedIn or at Southern Pacific Review, where he publishes short stories, poems, and news.