Spark – BMC Software | Blogs https://s7280.pcdn.co Thu, 18 Nov 2021 15:15:02 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Spark – BMC Software | Blogs https://s7280.pcdn.co 32 32 How To Use Jupyter Notebooks with Apache Spark https://s7280.pcdn.co/jupyter-notebooks-apache-spark/ Thu, 18 Nov 2021 00:00:03 +0000 https://www.bmc.com/blogs/?p=16013 Apache Spark is an open-source, fast unified analytics engine developed at UC Berkeley for big data and machine learning. Spark utilizes in-memory caching and optimized query execution to provide a fast and efficient big data processing solution. Moreover, Spark can easily support multiple workloads ranging from batch processing, interactive querying, real-time analytics to machine learning […]]]>

Apache Spark is an open-source, fast unified analytics engine developed at UC Berkeley for big data and machine learning. Spark utilizes in-memory caching and optimized query execution to provide a fast and efficient big data processing solution.

Moreover, Spark can easily support multiple workloads ranging from batch processing, interactive querying, real-time analytics to machine learning and graph processing. All these capabilities have led to Spark becoming a leading data analytics tool.

From a developer perspective, one of the best attributes of Spark is its support for multiple languages. Unlike many other platforms with limited options or requiring users to learn a platform-specific language, Spark supports all leading data analytics languages such as R, SQL, Python, Scala, and Java. Spark offers developers the freedom to select a language they are familiar with and easily utilize any tools and services supported for that language when developing.

When considering Python, Jupyter Notebooks is one of the most popular tools available for a developer. Yet, how can we make a Jupyter Notebook work with Apache Spark? In this post, we will see how to incorporate Jupyter Notebooks with an Apache Spark installation to carry out data analytics through your familiar notebook interface.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

How Jupyter Notebooks work

Jupyter is an interactive computational environment managed by Jupyter Project and distributed under the modified BSB license. A Notebook is a shareable document that combines both inputs and outputs to a single file. These notebooks can consist of:

  • Code
  • Mathematical equations
  • Narrative text
  • Visualizations
  • Statistical modeling
  • Other rich media

The beauty of a notebook is that it allows developers to develop, visualize, analyze, and add any kind of information to create an easily understandable and shareable single file. This approach is highly useful in data analytics as it allows users to include all the information related to the data within a specific notebook.

Jupyter supports over 40 programming languages and comes in two formats:

JupyterLab is the next-gen notebook interface that further enhances the functionality of Jupyter to create a more flexible tool that can be used to support any workflow from data science to machine learning. Jupyter also supports Big data tools such as Apache Spark for data analytics needs.

(Read our comprehensive intro to Jupyter Notebooks.)

How to connect Jupyter with Apache Spark

Scala is the ideal language to interact with Apache Spark as it is written in Scala.

However, most developers prefer to use a language they are familiar with, such as Python. Jupyter supports both Scala and Python. However, Python is the more flexible choice in most cases due to its robustness, ease of use, and the availability of libraries like pandas, scikit-learn, and TensorFlow. While projects like almond allow users to add Scala to Jupyter, we will focus on Python in this post.

(See why Python is the language of choice for machine learning.)

PySpark for Apache Spark & Python

Python connects with Apache Spark through PySpark. It allows users to write Spark applications using the Python API and provides the ability to interface with the Resilient Distributed Datasets (RDDs) in Apache Spark. PySpark allows Python to interface with JVM objects using the Py4J library. Furthermore, PySpark supports most Apache Spark features such as Spark SQL, DataFrame, MLib, Spark Core, and Streaming.

Configuring PySpark with Jupyter and Apache Spark

Before configuring PySpark, we need to have Jupyter and Apache Spark installed. In this section, we will cover the simple installation procedures of Spark and Jupyter. You can follow this Jupyter Notebooks for Data Analytics guide for detailed instructions on installing Jupiter, and you can follow the official documentation of Spark to set it up in your local environment.

Installing Spark

You will need Java, Scala, and Git as prerequisites for installing Spark. We can install them using the following command:

sudo apt install default-jdk scala git -y

Then, get the latest Apache Spark version, extract the content, and move it to a separate directory using the following commands.

wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar xf spark-*
sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark

RESULT

Result

Then we can set up the environmental variables by adding them to the shell configuration file (Ex: .bashrc / .zshrc) as shown below.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

The SPARK_HOME variable indicates the Apache Spark installation, and PATH adds the Apache Spark (SPARK_HOME) to the system paths. After that, the PYSPARK_PYTHON variable points to the Python installation.

Finally, run the start-master.sh command to start Apache Spark, and you will be able to confirm the successful installation by visiting http://localhost:8080/

Command

Command

Web UI

Web UI

Installing Jupyter

Installing Jupyter is a simple and straightforward process. It can be installed directly via Python package manager using the following command:

pip install notebook

Installing PySpark

There’s no need to install PySpark separately as it comes bundled with Spark. However, you also have the option of installing PySpark and the extra dependencies like Spark SQL or Pandas for Spark as a separate installation via the Python package manager.

You can directly launch PySpark by running the following command in the terminal.

pyspark

RESULT:

Result

Integrating PySpark with Jupyter Notebook

The only requirement to get the Jupyter Notebook reference PySpark is to add the following environmental variables in your .bashrc or .zshrc file, which points PySpark to Jupyter.

export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'

The PYSPARK_DRIVER_PYTHON points to Jupiter, while the PYSPARK_DRIVER_PYTHON_OPTS defines the options to be used when starting the notebook. In this case, it indicates the no-browser option and the port 8889 for the web interface.

With the above variables, your shell file should now include five environment variables required to power this solution.

Five Variables

Now, we can directly launch a Jupyter Notebook instance by running the pyspark command in the terminal.

pyspark command

Important note: Always make sure to refresh the terminal environment; otherwise, the newly added environment variables will not be recognized.

Now visit the provided URL, and you are ready to interact with Spark via the Jupyter Notebook.

Testing the Jupyter Notebook

Since we have configured the integration by now, the only thing left is to test if all is working fine. So, let’s run a simple Python script that uses Pyspark libraries and create a data frame with a test data set.

Create the data frame:

# Import Libraries
from pyspark.sql.types import StructType, StructField, FloatType, BooleanType
from pyspark.sql.types import DoubleType, IntegerType, StringType
import pyspark
from pyspark import SQLContext

# Setup the Configuration
conf = pyspark.SparkConf()

spark_context = SparkSession.builder.config(conf=conf).getOrCreate()
sqlcontext = SQLContext(sc)

# Setup the Schema
schema = StructType([
StructField("User ID", IntegerType(),True),
StructField("Username", StringType(),True),
StructField("Browser", StringType(),True),
StructField("OS", StringType(),True),
])

# Add Data
data = ([(1580, "Barry", "FireFox", "Windows" ),
(5820, "Sam", "MS Edge", "Linux"),
(2340, "Harry", "Vivaldi", "Windows"),
(7860, "Albert", "Chrome", "Windows"),
(1123, "May", "Safari", "macOS")
])

# Setup the Data Frame
user_data_df = sqlcontext.createDataFrame(data,schema=schema)

Print the data frame:

user_data_df.show()

RESULT:

Jupyter Notebook

If we look at the PySpark Web UI, which is accessible via port 4040, we can see the script execution job details as shown below.

Spark Jobs

The power of Spark + Jupyter

Apache Spark is a powerful data analytics and big data tool. PySpark allows users to interact with Apache Spark without having to learn a different language like Scala. The combination of Jupyter Notebooks with Spark provides developers with a powerful and familiar development environment while harnessing the power of Apache Spark.

Related reading

]]>
How to Apply Machine Learning to Cybersecurity https://www.bmc.com/blogs/machine-learning-cybersecurity/ Thu, 19 Dec 2019 09:03:47 +0000 https://www.bmc.com/blogs/?p=16083 In this article, we’ll show how to apply machine learning to cybersecurity. There are several use cases, but this article will focus on analyzing router logs. (This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.) Why use machine learning with cybersecurity It’s almost impossible for an analyst looking at a […]]]>

In this article, we’ll show how to apply machine learning to cybersecurity. There are several use cases, but this article will focus on analyzing router logs.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

Why use machine learning with cybersecurity

It’s almost impossible for an analyst looking at a time series chart of network traffic to draw any conclusion from what they are looking at. Why? People can’t see more than three dimensions.  And, too many false alerts cause analysts to simply ignore some of what they’re seeing—too much noise.

But machine learning makes it possible to flush out, for example, criminal hackers who are stealing data from your system and transmitting it to their command and control center.  This is what intrusion detection systems are supposed to do, but hackers use all kinds of techniques to avoid detection by traditional cybersecurity systems.  For example, they could transmit stolen data in small pieces and send each to a different IP address, such as hijacked home computer users, then use those hijacked computers to send those pieces to their hacker command and control center.

Machine learning helps us distill dozens or hundreds of data points into one or two metrics. Then, we can build our charts and alerts around those. Now, those alerts are significantly more valuable.

In this example we’ll illustrate one approach to looking at network traffic. We use router logs provided by Verizon from the Bro-type router. We’ll group each record into one of seven clusters, then we’ll look at traffic in those clusters with the smaller number of entries.  That, by definition, are our outliers.

K-means clustering

We use the k-mean clustering algorithm, which separates data along any number of axes. (For more, see k-means clustering with Apache Spark and Python Spark ML K-Means Examples or browse our Apache Spark guide using the right-hand menu.)

A data scientist would say that we are threading a hyperplane into n-dimensional space between the data points. Because we can’t visualize this, think of a 3D space, then thread a piece of paper between each set of data points such that points in one group are on one side of the paper and points in the other group are on the other.

This is an unsupervised model because there are no labels, only features. So, we don’t need to train the model, as there’s nothing to predict. Instead we are observing.

The code, explained

The code is available here, and the data here. This is data from a network analysis tool called Zeek, formerly called Bro.

The University of Cincinnati provides this description of the columns in this data:

  • ts—time; timestamp
  • uid—string; unique ID of connection
  • orig_h—addr; originating endpoint’s IP address (aka ORIG)
  • orig_p—port; originating endpoint’s TCP/UDP port or ICMP code
  • resp_h—addr; responding endpoint’s IP address (aka RESP)
  • resp_p—port; responding endpoint’s TCP/UDP port or ICMP code
  • proto—transport_protoTransport layer protocol of connection
  • service—string; dynamically detected application protocol, if any
  • duration—interval; time of last packet seen to time of first packet seen
  • orig_bytes—count; originator payload bytes, from sequence numbers if TCP
  • resp_bytes—count; responder payload bytes, from sequence numbers if TCP
  • conn_state—string; connection state (see conn.log:conn_state table)
  • local_orig—bool; if conn originated locally T; if remotely F. If Site::local_nets empty, always unset
  • missed_bytes—count; number of missing bytes in content gaps
  • history—string; connection state history (see conn.log:history table)
  • orig_pkts—count; number of ORIG packets
  • orig_ip_bytes—count; number of ORIG IP bytes (via IP total_length header field)
  • resp_pkts—count; number of RESP packets
  • resp_ip_bytes—count; number of RESP IP bytes (via IP total_length header field)
  • tunnel_parents—set; If tunneled, connection UID of encapsulating parent (s)
  • orig_cc—string; ORIG GeoIP country dode
  • resp_cc—string; RESP GeoIP country code

First, we load the csv file into a Spark dataframe.

from pyspark.sql.types import StructType, StructField, FloatType, BooleanType
from pyspark.sql.types import DoubleType, IntegerType, StringType
import pyspark
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import lit
from pyspark.sql.functions import udf, concat
 
from pyspark import SQLContext
 
conf = pyspark.SparkConf() 
 
sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = SQLContext(sc)

schema = StructType([
    StructField("ts", StringType(),True),    
    StructField("uid", StringType(),True),     
    StructField("origh", StringType(),True),         
    StructField("origp", StringType(),True),     
    StructField("resph", StringType(),True),      
    StructField("respp", StringType(),True),   
    StructField("proto", StringType(),True),     
    StructField("service" , StringType(),True),        
    StructField("duration", FloatType(),True),     
    StructField("origbytes", StringType(),True),     
    StructField("respbytes", StringType(),True),       
    StructField("connstate", StringType(),True),      
    StructField("localorig", StringType(),True),   
    StructField("missedbytes", StringType(),True),      
    StructField("history", StringType(),True),     
    StructField("origpkts", IntegerType(),True),     
    StructField("origipbytes", IntegerType(),True),       
    StructField("resppkts", IntegerType(),True),      
    StructField("respipbytes", IntegerType(),True),     
    StructField("tunnelparents", StringType(),True)    
              ])
        

df = sqlcontext.read.csv(path="/home/ubuntu/Documents/forensics/bigger.log", sep="\t", schema=schema) 
df2 = df.fillna(0)

Next, we register a UDF (user defined function).  We will use this to turn all the fields sent to this function into integers because machine learning, for the most part, only works with numbers.

colsInt = udf(lambda z: toInt(z), IntegerType())

sqlcontext.udf.register("colsInt", colsInt)

def toInt(s):
    if not s:
        return 0
    if isinstance(s, str) == True:
        st = [str(ord(i)) for i in s]
        return(int(''.join(st)))
    else:
        return s

Now, we create some additional columns which are the columns we have selected to feed into our model. For each of these, we will call the colsInt() UDF to convert those to numbers.

You could vary the choice of columns according to what hypotheses you want to follow. For example, below we look at the ports and traffic as well as the protocol.

  • There might be other metrics in that log that we could add or remove.
  • We should probably leave the destination IP address out of the model because of the hacker’s ability to hide their true destination.
  • We might drop the UDP protocol since sftp (which is TCP) would be the protocol they would use to transmit that.
  • Or, we could include the time of day in the local time zone to isolate after-hours events.

It all depends on what kind of activity you want to focus on.

Note that each of the .withColumn() statements create a new dataframe. This is because Spark dataframes are immutable.

a = df2.withColumn( 'iorigp',colsInt('origp'))
c = a.withColumn( 'irespp',colsInt('respp'))
d = c.withColumn( 'iproto',colsInt('proto'))
e = d.withColumn('iorigh',colsInt('origh'))
f = e.withColumn( 'iorigbytes',colsInt( 'origbytes'))
g = f.withColumn( 'irespbytes',colsInt('respbytes'))
h = g.withColumn(  'iorigpkts',colsInt( 'origpkts'))
i = h.withColumn( 'iorigipbytes',colsInt('origipbytes'))

columns =  ['iorigp','irespp','iproto', 'iorigbytes','irespbytes','iorigpkts','iorigipbytes']

The next step adds a column to our dataframe called features. This is a tuple of the columns we have selected. The K-means algorithm will expect there to be a features column.

vecAssembler = VectorAssembler(inputCols=columns, outputCol="features")
router = vecAssembler.transform(i)

Here, we use the K-means algorithm. One nice thing about Apache Spark is its machine learning algorithms are easy to use. They don’t require the reprocessing and reshaping that other frameworks do, and they work with Spark dataframes, so we could work with much larger sets of data. (Pandas does not scale like Spark dataframes do.)

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

 
kmeans = KMeans().setK(7).setSeed(1)
model = kmeans.fit(router)

predictions = model.transform(router)

p = predictions.groupby('prediction').count()
q = p.toPandas() 

We have grouped the observations into 7 clusters.  Cluster 0 has 40,303 router records, but cluster 2 has only 171. Clearly, those are outliers, so this is where we focus our cybersecurity analysis.

We can plot that as a bar chart to further show how the data is clustered.

from plotly.offline import plot

import pandas as pd


import plotly.graph_objects as go
fig = go.Figure(
    data=[go.Bar(x=q['prediction'],y=q['count'])],
    layout_title_text="K Means Count"
)
fig.show()

Plotly uses JavaScript to create popups to give you more information where you place the cursor. We’ve placed it at the point (prediction=2,count=121).

So, let’s make a new dataframe of just those records in cluster 2. (It’s actually row index 5 in the dataframe, so don’t confuse those two concepts.)

suspect = predictions.filter("prediction == 2")

Here we convert the output to Pandas, simply because the Jupyter notebook displays that data more clearly than it does dataframes, where it tends to chop off wide columns, making them hard to read.

x = suspect.select('ts','uid','origh','resph').toPandas()

You can see the same IP address shown more than a few times, which is probably a good place for further analysis. Look and see which machine it is and to whom it connects.

So, your analysts can look through logs in your applications, firewall, etc. to see what’s going on with those IP addresses. (Note that some of them are in IPv6 format.)

]]>
How to Write Spark UDFs (User Defined Functions) in Python https://www.bmc.com/blogs/how-to-write-spark-udf-python/ Thu, 12 Dec 2019 00:00:23 +0000 https://www.bmc.com/blogs/?p=16015 In this article, I’ll explain how to write user defined functions (UDF) in Python for Apache Spark. The code for this example is here. (This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.) Why do you need UDFs? Spark stores data in dataframes or RDDs—resilient distributed datasets. Think of these […]]]>

In this article, I’ll explain how to write user defined functions (UDF) in Python for Apache Spark. The code for this example is here.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

Why do you need UDFs?

Spark stores data in dataframes or RDDs—resilient distributed datasets. Think of these like databases. As with a traditional SQL database, e.g. mySQL, you cannot create your own custom function and run that against the database directly. You have to register the function first. That is, save it to the database as if it were one of the built-in database functions, like sum(), average, count(),etc.

That’s the case with Spark dataframes. With Spark RDDs you can run functions directly against the rows of an RDD.

Three approaches to UDFs

There are three ways to create UDFs:

  • df = df.withColumn
  • df = sqlContext.sql(“sql statement from <df>”)
  • rdd.map(customFunction())

We show the three approaches below, starting with the first.

Approach 1: withColumn()

Below, we create a simple dataframe and RDD. We write a function to convert the only text field in the data structure to an integer. That is something you might do if, for example, you are working with machine learning where all the data must be converted to numbers before you plug that into an algorithm.

Notice the imports below. Refer to those in each example, so you know what object to import for each of the three approaches.

Below is the complete code for Approach 1. First, we look at key sections. Create a dataframe using the usual approach:

df = spark.createDataFrame(data,schema=schema)

Now we do two things. First, we create a function colsInt and register it. That registered function calls another function toInt(), which we don’t need to register. The first argument in udf.register(“colsInt”, colsInt) is the name we’ll use to refer to the function. The second is the function we want to register.

colsInt = udf(lambda z: toInt(z), IntegerType())
spark.udf.register("colsInt", colsInt)

def toInt(s):
    if isinstance(s, str) == True:
        st = [str(ord(i)) for i in s]
        return(int(''.join(st)))
    else:
         return Null

Then we call the function colinsInt, like this. The first argument is the name of the new column we want to create. The second is the column in the dataframe to plug into the function.

df2 = df.withColumn( 'semployee',colsInt('employee'))

Remember that df[’employees’] is a column object, not a single employee. That means we have to loop over all rows that column—so we use this lambda (in-line) loop.

colsInt = udf(lambda z: toInt(z), IntegerType())

Here is Approach 1 all together:

import pyspark
from pyspark import SQLContext
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType
from pyspark.sql.functions import udf
from pyspark.sql import Row


conf = pyspark.SparkConf() 

 
sc = pyspark.SparkContext.getOrCreate(conf=conf)
spark = SQLContext(sc)

schema = StructType([
    StructField("sales", FloatType(),True),    
    StructField("employee", StringType(),True),
    StructField("ID", IntegerType(),True)
])

data = [[ 10.2, "Fred",123]]

df = spark.createDataFrame(data,schema=schema)

colsInt = udf(lambda z: toInt(z), IntegerType())
spark.udf.register("colsInt", colsInt)

def toInt(s):
    if isinstance(s, str) == True:
        st = [str(ord(i)) for i in s]
        return(int(''.join(st)))
    else:
         return Null


df2 = df.withColumn( 'semployee',colsInt('employee'))

Now we show the results. Notice that the new column semployee has been added. withColumn() creates a new dataframe so we created df2.

df2.show()

+-----+--------+---+----------+
|sales|employee| ID| semployee|
+-----+--------+---+----------+
| 10.2|    Fred|123|1394624364|
+-----+--------+---+----------+

Approach 2: Using SQL

The first step here is to register the dataframe as a table, so we can run SQL statements against it. df is the dataframe and dftab is the temporary table we create.

spark.registerDataFrameAsTable(df, "dftab")

Now we create a new dataframe df3 from the existing on df and apply the colsInt function to the employee column.

df3 = spark.sql("select sales, employee, ID, colsInt(employee) as iemployee from dftab")

Here are the results:

df3.show()

+-----+--------+---+----------+
|sales|employee| ID| iemployee|
+-----+--------+---+----------+
| 10.2|    Fred|123|1394624364|
+-----+--------+---+----------+

Approach 3: RDD Map

A dataframe does not have a map() function. If we want to use that function, we must convert the dataframe to an RDD using dff.rdd.

Apply the function like this:

rdd = df.rdd.map(toIntEmployee)

This passes a row object to the function toIntEmployee. So, we have to return a row object. The RDD is immutable, so we must create a new row.

Below, we refer to the employee element in the row by name and then convert each letter in that field to an integer and concatenate those.

def toIntEmployee(rdd):
    s = rdd["employee"]
    if isinstance(s, str) == True:
        st = [str(ord(i)) for i in s]
        e = int(''.join(st)) 
    else:
        e = s
    
    return Row(rdd["sales"],rdd["employee"],rdd["ID"],e)

Now we print the results:

for x in rdd.collect():
    print(x)

<row (10.199999809265137, 'Fred', 123, 70114101100)>
]]>
Python Spark ML K-Means Example https://www.bmc.com/blogs/python-spark-k-means-example/ Thu, 28 Nov 2019 00:00:52 +0000 https://www.bmc.com/blogs/?p=15883 In this article, we’ll show how to divide data into distinct groups, called ‘clusters’, using Apache Spark and the Spark ML K-Means algorithm. This approach works with any kind of data that you want to divide according to some common characteristics. This data shows medical patients, some with heart disease and some without it. We […]]]>

In this article, we’ll show how to divide data into distinct groups, called ‘clusters’, using Apache Spark and the Spark ML K-Means algorithm. This approach works with any kind of data that you want to divide according to some common characteristics.

This data shows medical patients, some with heart disease and some without it. We have already looked at this data using logistic regression. The classification will not tell us which have heart disease (that’s what logistic regression did in the previous post), but you can logically deduce that one set of patients is sick and the other is not, since the indicators of health are the input data.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

Understanding the Spark ML K-Means algorithm

Classification works by finding coordinates in n-dimensional space that most nearly separates this data. Think of this as a plane in 3D space: on one side are data points belonging to one cluster, and the others are on the other side.

In this example, we have 12 data features (data points). You will see that the plane has the coordinates shown below. It’s not a plane per se, because there are more than three dimensions, but you could call it a hyperplane. (Still, the mental image of a plane is close enough to explain the example.)

Below, we divide the data into two clusters, which explains why we have two sets of coordinates. Those are the centroids, or the central point of each cluster.

If we try to divide it into a higher number of clusters, that becomes less accurate, as the data is not spaced sufficiently apart.

[5.68235294e+01 5.68627451e-01 3.25490196e+00 1.35803922e+02
 3.00078431e+02 1.47058824e-01 1.18627451e+00 1.45225490e+02
 4.01960784e-01 1.09705882e+00 1.59803922e+00 8.23529412e-01]
[5.29821429e+01 7.44047619e-01 3.12500000e+00 1.28636905e+02
 2.19047619e+02 1.48809524e-01 9.22619048e-01 1.52380952e+02
 2.85714286e-01 1.02142857e+00 1.57738095e+00 5.77380952e-01]

Here is the code stored as a Zeppelin notebook. (The related post, in this Guide, explains the data field and how to load the Spark dataframe.) The data is the same, except we add a patient ID column.

%spark.pyspark

def isSick(x):
    if x in (3,7):
        return 0
    else:
        return 1
        
        
import pandas as pd
from pyspark.sql.types import StructType, StructField, NumericType

 
cols = ('age',       
      'sex',         
       'chest pain',           
       'resting blood pressure',    
       'serum cholestoral',       
       'fasting blood sugar',         
       'resting electrocardiographic results', 
       'maximum heart rate achieved',  
       'exercise induced angina',     
       'ST depression induced by exercise relative to rest',  
      'the slope of the peak exercise ST segment',     
      'number of major vessels ',       
       'thal',  
       'last')
      

      
data = pd.read_csv('/home/ubuntu/Downloads/heart.csv', delimiter=' ', names=cols)

data['isSick'] = data['thal'].apply(isSick)

rowCount = data['age'].count()
ids = np.arange(1,rowCount+1,1)
data['id'] = ids 


df = spark.createDataFrame(data)


from pyspark.ml.feature import VectorAssembler

features =   ('age',       
      'sex',         
       'chest pain',           
       'resting blood pressure',    
       'serum cholestoral',       
       'fasting blood sugar',         
       'resting electrocardiographic results', 
       'maximum heart rate achieved',  
       'exercise induced angina',     
       'ST depression induced by exercise relative to rest',  
      'the slope of the peak exercise ST segment',     
      'number of major vessels ') 
      
 

assembler = VectorAssembler(inputCols=features,outputCol="features")

dataset=assembler.transform(df)
dataset.select("features").show(truncate=False)

This code below is taken directly from the Spark ML documentation with some modifications, because there’s only one way to use the algorithm. A couple items to note:

  • KMeans().setK(2).setSeed(1)⁠—The number 2 is the number of clusters to divide the data into. We see that any number larger than 2 causes this value ClusteringEvaluator() to fall below 0.5, meaning it’s not a clear division. Another way to check the optimal number of clusters would be to plot an elbow curve.
  • predictions = model.transform(dataset)—This will add the prediction column to the dataframe, so we can show which patients qualify for which category.

The rest of the code is self-evident. The code prints the cluster centers for each division as well as the sum of squared errors. That’s a clue to how it works: it computes the distance of each data point from its guess as to the center of the cluster, adjusts the guesses, then repeats until the number reaches its minimum. The distance of each point from this central point is squared so that distance is always positive. The goal is to have the smallest number possible—the shortest distance between all the data points.

%spark.pyspark

from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.clustering import KMeans

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))


# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))

# Shows the result.
print("Cluster Centers: ")
ctr=[]
centers = model.clusterCenters()
for center in centers:
    ctr.append(center)
    print(center)

Below, the algorithm shows the squared Euclidean distance.

  • If this number is negative, the data cannot be separated at all.
  • Values closer to 1 indicate maximum separation.
  • Values close to zero mean the data could barely be separated.

In this example, 0.57 is not bad.

Silhouette with squared euclidean distance = 0.5702065126840549
Within Set Sum of Squared Errors = 548301.9823949672
Cluster Centers: 
[5.68235294e+01 5.68627451e-01 3.25490196e+00 1.35803922e+02
 3.00078431e+02 1.47058824e-01 1.18627451e+00 1.45225490e+02
 4.01960784e-01 1.09705882e+00 1.59803922e+00 8.23529412e-01]
[5.29821429e+01 7.44047619e-01 3.12500000e+00 1.28636905e+02
 2.19047619e+02 1.48809524e-01 9.22619048e-01 1.52380952e+02
 2.85714286e-01 1.02142857e+00 1.57738095e+00 5.77380952e-01]

In order to make to easily visualize it, we convert the Spark dataframe to a Pandas dataframe so we can use the Zeppelin function z.show to list the table.

%spark.pyspark
pandasDF=predictions.toPandas()
centers = pd.DataFrame(ctr,columns=features)

You cannot graph this data because a 3D graph allows you to plot only three variables. So, we can’t show how heart patients are separated, but we can put them in a tabular report using z.display() and observe the prediction column, which puts them in one category or the other:

%spark.pyspark
z.show(pandasDF)

As we can see, patients 90 and 91 are in cluster 1. Patient 92 is in cluster 0.

]]>
Using Python and Spark Machine Learning to Do Classification https://www.bmc.com/blogs/python-spark-machine-learning-classification/ Thu, 24 Oct 2019 08:33:14 +0000 https://www.bmc.com/blogs/?p=15708 We’ve been writing about how to use Spark ML with the Scala programming language. But not many programmers know Scala. Python has moved ahead of Java in terms of number of users, largely based on the strength of machine learning. So, let’s turn our attention to using Spark ML with Python. You could say that […]]]>

We’ve been writing about how to use Spark ML with the Scala programming language. But not many programmers know Scala. Python has moved ahead of Java in terms of number of users, largely based on the strength of machine learning. So, let’s turn our attention to using Spark ML with Python.

You could say that Spark is Scala-centric. Scala has both Python and Scala interfaces and command line interpreters. Scala is the default one. The Python one is called pyspark. The most examples given by Spark are in Scala and in some cases no examples are given in Python.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

Apache Atom

Python is the preferred language to use for data science because of NumPy, Pandas, and matplotlib, which are tools that make working with arrays and drawing charts easier and can work with large arrays of data efficiently. But Spark is designed to work with enormous amount of data, spread across a cluster. It’s good practice to use both tools, switching back and forth, perhaps, as the demand warrants it.

But as we will see, because Spark dataframe is not the same as a Pandas dataframe, there is not 100% compatibility among all of these objects. You must convert Spark dataframes to lists and arrays and other structures in order to plot them with matplotlib. Because you can’t slice arrays using the familiar [:,4], it takes more code to do the same operation.

But the other issue is performance. Apache Atom exists to efficiently convert objects in java processes to python processes and vice versa. Spark is written in Java and Scala. Scala rides atop Java. Python, of course, runs in a Python process.

Arrow speeds up operations with as the conversion of Spark dataframes to Pandas dataframes and with column wise operations such as .withcolumn().

Spark discusses some of the issues around this and the config change you need to make in Spark to take advantage of this boost in performance in their Apache Arrow documentation.

Heart patient data

Download the data from the University of São Paolo data set, available here. If you are curious, see this discussion.

The columns are:

  1. Age
  2. Sex
  3. Chest pain type (4 values)
  4. Resting blood pressure
  5. Serum cholesterol in mg/dl
  6. Fasting blood sugar > 120 mg/dl
  7. Resting electrocardiographic results (values 0,1,2)
  8. Maximum heart rate achieved
  9. Exercise induced angina
  10. Oldpeak = ST depression induced by exercise relative to rest
  11. Slope of the peak exercise ST segment
  12. Number of major vessels (0-3) colored by fluoroscopy
  13. Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

The code, explained

The goal is to build a predictive binary logistic regression model using Spark ML and Python that predicts whether someone has a heart defect. The code below is available in a Zeppelin notebook here.

First, we read the data in and assign column names. Since the data is small, and because Pandas is easier, we read it into a Pandas dataframe. Then we convert it to a Spark dataframe with spark.createDataFrame().

You might see what I mean about the Spark dataframe lacking some of the features of Pandas. In particular we use Pandas so we can use .iloc() to take the first 13 columns and drop the last one, which seems to be noise not intended for the data.

%spark.pyspark

import pandas as pd
from pyspark.sql.types import StructType, StructField, NumericType

 
cols = ('age',       
      'sex',         
       'chest pain',           
       'resting blood pressure',    
       'serum cholesterol',       
       'fasting blood sugar',         
       'resting electrocardiographic results', 
       'maximum heart rate achieved',  
       'exercise induced angina',     
       'ST depression induced by exercise relative to rest',  
      'the slope of the peak exercise ST segment',     
      'number of major vessels ',       
       'thal',  
       'last')
      

      
data = pd.read_csv('/home/ubuntu/Downloads/heart.csv', delimiter=' ', names=cols)

data = data.iloc[:,0:13]

data['isSick'] = data['thal'].apply(isSick)

df = spark.createDataFrame(data)

The field thal indicates whether the patient has a heart problem. The numbers are as follows:

  • A value of 3 means the patient is healthy (normal).
  • A value of 6 means the patient’s health problem has been fixed.
  • A value of 7 means the patient’s health problem can be fixed.

So, write this function isSick() to flag 0 as negative and 1 as positive, because binary logistic regression requires one of two outcomes.

def isSick(x):
    if x in (3,7):
        return 0
    else:
        return 1

With machine learning and classification or regression problems we have:

  • A matrix of features, including the patient’s age, blood sugar, etc.
  • A vector of labels, which indicates whether the patient has a heart problem.

Because we are using a Zeppelin notebook, and PySpark is the Python command shell for Spark, we write %spark.pyspark at the top of each Zeppelin cell to indicate the language and interpreter we want to use.

Next, we indicate which columns in the df dataframe we want to use as features. Then we use the VectorAasembler to put all twelve of those fields into a new column called features that contains all of these as an array.

Now we create the Spark dataframe raw_data using the transform() operation and selecting only the features column.

%spark.pyspark
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler

features =   ('age',       
      'sex',         
       'chest pain',           
       'resting blood pressure',    
       'serum cholestoral',       
       'fasting blood sugar',         
       'resting electrocardiographic results', 
       'maximum heart rate achieved',  
       'exercise induced angina',     
       'ST depression induced by exercise relative to rest',  
      'the slope of the peak exercise ST segment',     
      'number of major vessels ') 

assembler = VectorAssembler(inputCols=features,outputCol="features")
 
raw_data=assembler.transform(df)
raw_data.select("features").show(truncate=False)

We use the Standard Scaler to put all the numbers on the same scale, which is standard practice for machine learning. This takes the observation and subtracts the mean, and then divides that by the standard deviation.

%spark.pyspark
from pyspark.ml.feature import StandardScaler

standardscaler=StandardScaler().setInputCol("features").setOutputCol("Scaled_features")
raw_data=standardscaler.fit(raw_data).transform(raw_data)
raw_data.select("features","Scaled_features").show(5)

Here is what the features data looks like now:

+--------------------------------------------------------+
|features                                                |
+--------------------------------------------------------+
|[70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0]|
|[67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0]|
|[57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0]|
|[64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0]|
|[74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0]|
|[65.0,1.0,4.0,120.0,177.0,0.0,0.0,140.0,0.0,0.4,1.0,0.0]|
|[56.0,1.0,3.0,130.0,256.0,1.0,2.0,142.0,1.0,0.6,2.0,1.0]|
|[59.0,1.0,4.0,110.0,239.0,0.0,2.0,142.0,1.0,1.2,2.0,1.0]|
|[60.0,1.0,4.0,140.0,293.0,0.0,2.0,170.0,0.0,1.2,2.0,2.0]|
|[63.0,0.0,4.0,150.0,407.0,0.0,2.0,154.0,0.0,4.0,2.0,3.0]|
|[59.0,1.0,4.0,135.0,234.0,0.0,0.0,161.0,0.0,0.5,2.0,0.0]|
|[53.0,1.0,4.0,142.0,226.0,0.0,2.0,111.0,1.0,0.0,1.0,0.0]|

As usual, we split the data into training and test datasets. We don’t have much data so we will use a 50/50 split.

%spark.pyspark
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

training, test = raw_data.randomSplit([0.5, 0.5], seed=12345) 

Now we create the logistic Regression Model and train it, meaning have the model calculate the coefficients and intercept that most nearly matches the results that we have in the label column isSick

%spark.pyspark
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="isSick", featuresCol="Scaled_features",maxIter=10)
model=lr.fit(training)
predict_train=model.transform(training)
predict_test=model.transform(test)
predict_test.select("isSick","prediction").show(10)

Here we show the first few rows in side by side comparison. These are, for the most part, correct.

+------+----------+
|isSick|prediction|
+------+----------+
|     0|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       1.0|
|     0|       1.0|
|     0|       0.0|
+------+----------+

This shows the coefficients and intercept.

%spark.pyspark
print("Multinomial coefficients: " + str(model.coefficientMatrix))
print("Multinomial intercepts: " + str(model.interceptVector))

Here they are:

Multinomial coefficients: DenseMatrix([[-0.41550466,  1.21573123,  0.16600826,  0.36478609,  0.33716549,
              -0.020686  , -0.2092542 , -0.86514924,  0.1427418 , -0.3610241 ,
               0.57324392,  0.42563706]])
Multinomial intercepts: [-0.2767309778166021]

Now we use some Spark SQL functions F to create a new column correct when IsSick is equal to prediction, meaning the predicted result equaled the actual results.

%spark.pyspark
import pyspark.sql.functions as F
check = predict_test.withColumn('correct', F.when(F.col('isSick') == F.col('prediction'), 1).otherwise(0))
check.groupby("correct").count().show()

Here are the results:

+-------+-----+
|correct|count|
+-------+-----+
|      1|  137|
|      0|   10|
+-------+-----+

So, the accuracy is 137 / 137 + 10 = 93%

There are other ways to show the accuracy of the model, like area under the curve. But this is the simplest to understand, unless you are an experienced data scientist and statistician. We will explain more complex ways of checking the accuracy in future articles.

]]>
Spark’s Machine Learning Pipeline: An Introduction https://www.bmc.com/blogs/introduction-to-sparks-machine-learning-pipeline/ Wed, 06 Jun 2018 00:00:36 +0000 http://www.bmc.com/blogs/?p=12336 Here we explain what is a Spark machine learning pipeline. We will do this by converting existing code that we wrote, which is done in stages, to pipeline format. This will run all the data transformation and model fit operations under the pipeline mechanism. The existing Apache Spark ML code is explained in two blog […]]]>

Here we explain what is a Spark machine learning pipeline. We will do this by converting existing code that we wrote, which is done in stages, to pipeline format. This will run all the data transformation and model fit operations under the pipeline mechanism.

The existing Apache Spark ML code is explained in two blog posts: part one and part two. You are encouraged to read that first as you will need to do that to generate data to feed into this program. Plus you will understand what we have changed and thus learn the pipeline concept. (Or if you want to take a shortcut and skip reading that you could just use the maintenance_data.csv as both the test and training data.)

The Spark pipeline object is org.apache.spark.ml.{Pipeline, PipelineModel}.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

In general a machine learning pipeline describes the process of writing code, releasing it to production, doing data extractions, creating training models, and tuning the algorithm. It should be a continuous process as a team works on their ML platform. But for Apache Spark a pipeline is an object that puts transform, evaluate, and fit steps into one object org.apache.spark.ml.Pipeline. These steps are called a workflow. (Presumably there are some performation, distribution. or other benefits to doing this, but the Spark documentation does not spell that out.) But at least it mimics the pipeline from, at least regarding the data transformation operations.

To start, we look at the graphic supplied by Apache Spark. Basically you start with creating a dataframe then you put any transformation steps into the pipeline object plus the ML algorithm you will use.

In other words, in the graphic above the dataframe is created through reading data from Hadoop or whatever and then transform() and fit() operations are performed on it to add feature and label columns, which is the format required for the logistic regression ML algorithm. The discrete several steps are fed into the pipeline object. Transform means to modify a dataframe, such as adding features and labels columns. Fit means to feed the dataframe into the ML algorithm and then calculate the answer, i.e. create the model. You can also run transform directly on a dataframe. In a sense this is what the pipeline does for us.

With regards to the graphic above, the code shown below shows how that is implemented. Each of these three steps will be handled by the pipeline object in: val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr)).

val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

To illustrate using our own code, we rewrite the code from the blog posts mentioned above, which was two separate programs (create model and make predictions) into one program shown here.

First we have the usual imports.

import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.databricks.spark.csv
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.Row

Next, we first read in the dataframe from a text file as usual but instead of performing transform() operations by themselves on the dataframe, we feed the VectorAssembler(), StringIndexer(), and LogisticRegression() into new Pipeline().setStages(Array(assembler, labelIndexer, lr)). Then we run pipeline.fit() on the original dataframe. The pipeline knows what transformations to run and in which order because we specified that here .setStages(Array(assembler, labelIndexer, lr)). At the end we have our trained model.

Again, here is the old code. After that is the new.

val df2 = assembler.transform(df)

val df3 = labelIndexer.fit(df2).transform(df2)

val model = new LogisticRegression().fit(df3)

New code:

var file = "hdfs://localhost:9000/maintenance/maintenance_data.csv";

val sqlContext = new SQLContext(sc)

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",";").load(file)

val featureCols =  Array("lifetime", "pressureInd", "moistureInd", "temperatureInd")

val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")

val labelIndexer = new StringIndexer().setInputCol("broken").setOutputCol("label")

val lr = new LogisticRegression()

val pipeline = new Pipeline().setStages(Array(assembler, labelIndexer, lr))

val model = pipeline.fit(df)

Now the predictions are easy. The model already knows what transformations to run. So we just read in the test data (You created that in blog post part one.) and run transform() on it. Then we filter those whose logistic regression value is > 0, i.e., 1, the print out those machines that require maintenance.

var predictFile = "hdfs://localhost:9000/maintenance/2018.05.30.09.46.55.csv"

val testdf = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",";").load(file)

val predictions =  model.transform(testdf)

predictions.select("team", "provider", "prediction").filter($"prediction" > 0).collect().foreach { case Row(team: String,provider: String, prediction: Double) =>
    println(s"($team, $provider, $prediction) --> team=$team, provider=$provider, prediction=$prediction")
  }

The results look like this:

(TeamA, Provider4, 1.0) --> team=TeamA, provider=Provider4, prediction=1.0
(TeamC, Provider2, 1.0) --> team=TeamC, provider=Provider2, prediction=1.0
(TeamC, Provider4, 1.0) --> team=TeamC, provider=Provider4, prediction=1.0
(TeamB, Provider1, 1.0) --> team=TeamB, provider=Provider1, prediction=1.0
(TeamC, Provider2, 1.0) --> team=TeamC, provider=Provider2, prediction=1.0
(TeamB, Provider2, 1.0) --> team=TeamB, provider=Provider2, prediction=1.0
(TeamA, Provider2, 1.0) --> team=TeamA, provider=Provider2, prediction=1.0
]]>
How to use Apache Spark to make predictions for preventive maintenance https://www.bmc.com/blogs/how-to-use-apache-spark-to-make-predictions-for-preventive-maintenance/ Fri, 25 May 2018 00:00:39 +0000 http://www.bmc.com/blogs/?p=12286 In part one we explained how to create a training model. In this part we show how to make predictions to show which machines in our dataset should be taken out of service for maintenance. First, here is how to submit the job to Spark with spark-submit: jar file that contains com.bmc.lr.makePrediction what file to […]]]>

In part one we explained how to create a training model. In this part we show how to make predictions to show which machines in our dataset should be taken out of service for maintenance.

First, here is how to submit the job to Spark with spark-submit:

  • jar file that contains com.bmc.lr.makePrediction
  • what file to read, i.e., the one you just generated in Part I (put link to previous article)
  • which model to use i.e., the one you just generated in Part I (put link to previous article)

spark-submit
--class com.bmc.lr.makePrediction
--master local[*] hdfs://localhost:9000/maintenance/lr-assembly-1.0.jar
hdfs://localhost:9000/maintenance/2018.04.20.15.48.54.csv
hdfs://localhost:9000/maintenance/maintenance_model

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

Code

Here are the usual imports and package name.

package com.bmc.lr

import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext

Read the command line arguments and set up the Spark and SQL context. Create the object makePrediction so that we can instantiate class com.bmc.lr.makePrediction from spark-submit.

object makePrediction {

def main(args: Array[String]): Unit = {
 val conf = new SparkConf().setAppName("lr")
 val sc = new SparkContext(conf)
var modelFile = args(1);
var file = args(0);

val sqlContext = new SQLContext(sc)

Use databricks to read the input file and create a org.apache.spark.sql.DataFrame. Later databricks will be useful to save the output to Hadoop as it requires only line of code.

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",";").load(file)

At this point df looks like this:

lifetime|broken| pressureInd|       moistureInd|    temperatureInd|team|  provider|
+------------------+------+------------------+------------------+--------------
|17.792044609594086| 0|104.25543343273085| 88.67373165099184|122.75111030066334|  47|Ford F-750|
| 69.39805687060903| 0| 93.34561942201603| 86.62996015487022|  84.9796059428202|  34|Ford F-750|
| 84.53664924532875| 0|110.64579687466193|125.89351825805036| 58.34915688191312|   8|Ford F-750|

org.apache.spark.ml.featureVectorAssembler transforms the features in featureCols into a vector column. We need it in this format to plug into LogisticRegressionModel.transform() which takes the features(lifetime, pressureInd, moistureInd, temperatureInd) and labels (broken which we rename to label for clarity) as a step to predict which machines will fail.

Then we take the broken (1 or 0) label and put that into org.apache.spark.ml.feature.StringIndexer to contain a single value, the label. org.apache.spark.ml.feature.StringIndexer.fit() fits the data to the model, meaning make the predictions.

You can see what the features vectors look like after we create df3 (see below).

val featureCols =  Array("lifetime", "pressureInd", "moistureInd", "temperatureInd")

val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")

val labelIndexer = new StringIndexer().setInputCol("broken").setOutputCol("label")

val df2 = assembler.transform(df)

val df3 = labelIndexer.fit(df2).transform(df2)

Here we show the features column (albeit it is chopped off in the display). If you recall this is like the dense vectors we have been using to plug into to training models.

df3.select("features").show()
+--------------------+
|            features|
+--------------------+
|[17.7920446095940...|
|[69.3980568706090...|
|[84.5366492453287...|
|[71.6002965259175...|
|[46.8176995523900...|

Load saved org.apache.spark.ml.classification.LogisticRegressionModel model from Hadoop file system. We trained this model in the first program.

val model = LogisticRegressionModel.load(modelFile)

Here we select just the columns that we want (getting rid of features, since it is hard to print and view and redundant too). Then we show only those machines that require maintenance, i.e., those whose predicted value of broken is 1 (true).

val predictions = model.transform(df3)

var df4 = predictions.select ("team", "provider", "pressureInd", "moistureInd", "temperatureInd", "label", "prediction")

val df5 = df4.filter("prediction=1")

df5.show()

With df5.show() we see that 4 machines need maintenance:

+----+----------+------------------+------------------+-----------------+-----+----------+
|team|  provider|       pressureInd|       moistureInd|   temperatureInd|label|prediction|
+----+----------+------------------+------------------+-----------------+-----+----------+
|  34|Ford F-750| 93.34561942201603| 86.62996015487022| 84.9796059428202|  0.0|       1.0|
|   8|Ford F-750|110.64579687466193|125.89351825805036|58.34915688191312|  0.0|       1.0|
|  83|Ford F-750| 55.77009802003853|   66.832777175712|125.2982705340028|  0.0|       1.0|
|   2|Ford F-750|  84.1763960666348|  82.1342684415311|57.73202884026434|  0.0|       1.0|
+----+----------+------------------+------------------+-----------------+-----+----------+

Now we see another example of how useful databricks is by legging us write the predictions in a .csv to Hadoop. We save the output in a file whose name is the date. Note also that we import java.util.Date since it is the easiest way to create different date formats.

import java.util.Date
import java.text.SimpleDateFormat
val date = new Date()
var dformat:SimpleDateFormat = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss");
val csvFile = "hdfs://localhost:9000/maintenance/" + dformat.format(date) + ".csv"
df5.write.format("com.databricks.spark.csv").option("header", "true").save(csvFile)

}
}

Now we show the data:

hadoop fs -cat hdfs://localhost:9000/maintenance/2018.05.16.09.07.33.csv/part-00000-7a9a4de6-ab90-4cd4-b9f5-d0d30c881b2c-c000.csv
team,provider,pressureInd,moistureInd,temperatureInd,label,prediction
34,Ford F-750,93.34561942201603,86.62996015487022,84.9796059428202,0.0,1.0
8,Ford F-750,110.64579687466193,125.89351825805036,58.34915688191312,0.0,1.0
83,Ford F-750,55.77009802003853,66.832777175712,125.2982705340028,0.0,1.0
2,Ford F-750,84.1763960666348,82.1342684415311,57.73202884026434,0.0,1.0

]]>
Predictive and Preventive Maintenance using IoT, Machine Learning & Apache Spark https://www.bmc.com/blogs/predictive-and-preventive-maintenance-using-iot-machine-learning-apache-spark/ Wed, 23 May 2018 00:00:04 +0000 http://www.bmc.com/blogs/?p=12280 Here we explain a use case of how to use Apache Spark and machine learning. This is the classic preventive maintenance problem, one of the most common business use cases of machine learning and IoT too. We take the data for this analysis from the Kaggle website, a site dedicated to data science. This is […]]]>

Here we explain a use case of how to use Apache Spark and machine learning. This is the classic preventive maintenance problem, one of the most common business use cases of machine learning and IoT too. We take the data for this analysis from the Kaggle website, a site dedicated to data science. This is sensor data from machines, specifically moisture, temperature, and pressure. The goal is to predict which machines needs to be taken out of service for maintenance. The code we have written is available here.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

The Data Explained

The raw data is here.

These is a downside to the data that we have here, which is this is run-to-failure data. The goal of PM is not to run a machine until it breaks down. Rather is it to keep the machine in working order.

Architecture

  • Apache Spark
  • Apache Hadoop
  • Scala
  • Spark Machine Learning API

We write three programs:

  1. create a logistic regression training model
  2. create some sample data by taking actual data and adding noise based upon the standard deviation of that
  3. feed data into model to show which vehicles need maintenance

We explain the first two steps here. In a second blog post we will explain item #3.

Create a Training Model

This program reads data and saves a logistic regression model. The second program then creates data given the mean, stddev, max, and minn of the variables in that training set. Then the last program runs predictions and prints out those records that are flagged with 1. With logistic regression 1 means true, which in this example means the machine requires maintenance based upon our prediction.

build.sbt

In order to compile the Scala code below you need sbt (the Scala Build tool) and this build.sbt file. This tells Scala which files to add when it builds the Jar file that we will submit to Apache Spark.

name         := "lr"
version      := "1.0"
organization := "com.bmc"
assemblyJarName in assembly := "bmclr.jar"
scalaVersion := "2.11.8"
mainClass := Some("com.bmc.lr")
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "2.3.0" % "provided", "com.databricks" %% "spark-csv" % "1.5.0", "org.apache.spark" %% "spark-sql" % "2.3.0" % "provided", "org.apache.spark" %% "spark-mllib" % "2.3.0" % "provided")
resolvers += Resolver.mavenLocal

Note: if you get error: not found: value assemblyassemblyJarName in assembly then you need to add the sbt-assembly plugin. So in file project/assembly.sbt add:

addSbtPlugin(“com.eed3si9n” % “sbt-assembly” % “0.14.8”)

Run Training Model

In order to run the code below you need to have Hadoop started and then submit the job to Apache Spark like this.

The parameters are:

  • jar file to read to find class class com.bmc.lr.readCSV
  • location of the maintenance data .csv file
  • where to store the saved model (file must not exist). You need to hdfs fs -mkdir /maintenance to create this folder.
spark-submit
--verbose
--class com.bmc.lr.readCSV
--master local[*]
hdfs://localhost:9000/maintenance/lr-assembly-1.0.jar
hdfs://localhost:9000/maintenance/maintenance_data.csv
hdfs://localhost:9000/maintenance/maintenance_model

Training Model Code

Now we explain the code.

First we import Apache Spark linear algebra, machine learning, databricks, and other APIs we will need. We have to give this program a package name since scala is compiled to Java byte code and we will make a Jar file from this.

package com.bmc.lr
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.databricks.spark.csv
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}

We make an object with a main method that we can pass objects too. We have to create the SparkContext and SQLContext specifically since we are not running in the command-line interpreter where those are created for us already. For appName we can set any unique value.

object readCSV {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("lr")
val sc = new SparkContext(conf)
var file = args(0);
val sqlContext = new SQLContext(sc)

We use databricks to read from a .csv file to make a dataframe.

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",";").load(file)
df.show()

Now we build the features (input variables) and labels (output variables). Since we are doing logistic regression there is only 1 label: broken (1 or 0).
We create a model and then run the train method. Finally we save it to the Hadoop file system.

val featureCols =  Array("lifetime", "pressureInd", "moistureInd", "temperatureInd")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val labelIndexer = new StringIndexer().setInputCol("broken").setOutputCol("label")
val df2 = assembler.transform(df)
val df3 = labelIndexer.fit(df2).transform(df2)
val model = new LogisticRegression().fit(df3)
model.save(args(1))
println("Training model saved as $args(1)")
}
}

Generate Data

We use the next program to create sample data, drawing on the original data file and generate a range of values based on the max, min, and standard deviation distribution of the data.

The arguments to this program are:

  • how many records to create
  • where is Hadoop core-site.xml
  • input data file

We do not specify the output file, since Hadoop does not create one file. Instead we hard-code a folder below. After the code we explain how to view the data.

spark-submit
--class com.bmc.lr.generateData
--master local[*]
hdfs://localhost:9000/maintenance/lr-assembly-1.0.jar
1000
/usr/local/sbin/hadoop-3.1.0/etc/hadoop/core-site.xml
hdfs://localhost:9000/maintenance/maintenance_data.csv

When the program runs stdout looks something like this. If you run 1,000 records it will take some minutes to run.

018-05-15 09:25:46 INFO  MemoryStore:54 - Block broadcast_15_piece0 stored as bytes in memory (estimated size 6.6 KB, free 364.6 MB)
2018-05-15 09:25:46 INFO  BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on ip-172-31-13-71.eu-west-1.compute.internal:35220 (size: 6.6 KB, free: 366.1 MB)
2018-05-15 09:25:46 INFO  SparkContext:54 - Created broadcast 15 from broadcast at DAGScheduler.scala:1039
2018-05-15 09:25:46 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 9 (MapPartitionsRDD[28] at describe at generateData.scala:76) (first 15 tasks are for partitions Vector(0))
2018-05-15 09:25:46 INFO  TaskSchedulerImpl:54 - Adding task set 9.0 with 1 tasks
2018-05-15 09:25:46 INFO  TaskSetManager:54 - Starting task 0.0 in stage 9.0 (TID 9, localhost, executor driver, partition 0, ANY, 7754 bytes)
2018-05-15 09:25:47 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 16)
2018-05-15 09:25:47 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 16)
2018-05-15 09:25:47 INFO  BlockManagerInfo:54 - Removed broadcast_21_piece0 on ip-172-31-13-71.eu-west-1.compute.internal:35220 in memory (size: 6.6 KB, free: 366.1 MB)
2018-05-15 09:25:47 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 16 (MapPartitionsRDD[46] at describe at generateData.scala:76), which has no missing parents
2018-05-15 09:25:47 INFO  ContextCleaner:54 - Cleaned accumulator 189
2018-05-15 09:25:47 INFO  ContextCleaner:54 - Cleaned accumulator 258
2018-05-15 09:25:47 INFO  ContextCleaner:54 - Cleaned accumulator 399
2018-05-15 09:25:47 INFO  ContextCleaner:54 - Cleaned accumulator 75
2018-05-15 09:25:47 INFO  MemoryStore:54 - Block broadcast_26_piece0 stored as bytes in memory (estimated size 9.4 KB, free 364.4 MB)
2018-05-15 09:25:47 INFO  BlockManagerInfo:54 - Added broadcast_26_piece0 in memory on ip-172-31-13-71.eu-west-1.compute.internal:35220 (size: 9.4 KB, free: 366.1 MB)
2018-05-15 09:25:47 INFO  SparkContext:54 - Created broadcast 26 from broadcast at DAGScheduler.scala:1039
2018-05-15 09:25:47 INFO  BlockManagerInfo:54 - Removed broadcast_7_piece0 on ip-172-31-13-71.eu-west-1.compute.internal:35220 in memory (size: 23.4 KB, free: 366.1 MB)
2018-05-15 09:25:47 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ShuffleMapStage 16 (MapPartitionsRDD[46] at describe at generateData.scala:76) (first 15 tasks are for partitions Vector(0))
2018-05-15 09:25:47 INFO  TaskSchedulerImpl:54 - Adding task set 16.0 with 1 tasks
2018-05-15 09:25:47 INFO  TaskSetManager:54 - Starting task 0.0 in stage 16.0 (TID 16, localhost, executor driver, partition 0, ANY, 8308 bytes)
2018-05-15 09:25:47 INFO  Executor:54 - Running task 0.0 in stage 16.0 (TID 16)
2018-05-15 09:25:47 INFO  ContextCleaner:54 - Cleaned accumulator 211
2018-05-15 09:25:47 INFO  BlockManagerInfo:54 - Removed broadcast_22_piece0 on ip-172-31-13-71.eu-west-1.compute.internal:35220 in memory (size: 23.4 KB, free: 366.2 MB)
2018-05-15 09:25:47 INFO  FileScanRDD:54 - Reading File path: hdfs://localhost:9000/maintenance/maintenance_data.csv, range: 0-72679, partition values: [empty row]
2018-05-15 09:25:47 INFO  MemoryStore:54 - Block broadcast_27_piece0 stored as bytes in memory (estimated size 6.6 KB, free 365.7 MB)
2018-05-15 09:25:47 INFO  BlockManagerInfo:54 - Added broadcast_27_piece0 in memory on ip-172-31-13-71.eu-west-1.compute.internal:35220 (size: 6.6 KB, free: 366.2 MB)
2018-05-15 09:25:47 INFO  SparkContext:54 - Created broadcast 27 from broadcast at DAGScheduler.scala:1039
2018-05-15 09:25:47 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 17 (MapPartitionsRDD[48] at describe at generateData.scala:76) (first 15 tasks are for partitions Vector(0))
2018-05-15 09:25:47 INFO  TaskSchedulerImpl:54 - Adding task set 17.0 with 1 tasks
2018-05-15 09:25:47 INFO  TaskSetManager:54 - Starting task 0.0 in stage 17.0 (TID 17, localhost, executor driver, partition 0, ANY, 7754 bytes)
2018-05-15 09:25:47 INFO  Executor:54 - Running task 0.0 in stage 17.0 (TID 17)
2018-05-15 09:25:47 INFO  ShuffleBlockFetcherIterator:54 - Gett

Code

We start with the imports.

import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.spark.sql.SQLContext
import java.io.DataOutputStream
import java.io.BufferedWriter
import org.apache.hadoop.fs.FSDataOutputStream
import java.io.OutputStreamWriter
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FileSystem
import java.util.Date
import java.text.SimpleDateFormat

And this must be an object as we mentioned above.

object generateData {

We use org.apache.commons.math3.distribution.NormalDistribution to generate random numbers from a standard deviation based upon the data in each column and subject to a mean and max. In other words we need to simulate engines operating all all levels: normal, broken, and soon to require maintenance.

def generateData (mean: Double, stddev: Double, max: Double, min:Double) : Double = {
var x:NormalDistribution = new NormalDistribution(stddev,mean)
var y:Double = x.sample()
while( (y >= max) || (y <= min) ) {
y = x.sample()
}
return y
}
def createData(x: org.apache.spark.sql.DataFrame) : Double = {
var y:Array[org.apache.spark.sql.Row] = x.collect();
var mean:Double = y(1)(1).toString.toDouble;
var stddev:Double = y(2)(1).toString.toDouble;
var min:Double = y(3)(1).toString.toDouble;
var max:Double = y(4)(1).toString.toDouble;
return generateData(mean,stddev,max,min);
}

The usual main() function.

def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("lr")
val sc = new SparkContext(conf)
var records:Int = args(0).toInt;
var hdfsCoreSite = args(1)
var file = args(2)

Create SQLContext and read data file into a dataframe using databricks. Indicate what we want to call out column heading in the output file.

val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",";").load(file)
var header = "lifetime;broken;pressureInd;moistureInd;temperatureInd;team;provider"

Ths simulation here is that we receive IoT (internet of things) data on some frequency. So we save files in folders with the format YMDdhms

Below that we write the data to the Hadoop file system.

val date = new Date()
var dformat:SimpleDateFormat = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss");
val csvFile =  "/maintenance/" + dformat.format(date) + ".csv"
println("writing to " + csvFile)
val fs = {
val conf = new Configuration()
conf.addResource(new Path(hdfsCoreSite))
FileSystem.get(conf)
}
val dataOutputStream: FSDataOutputStream = fs.create(new Path(csvFile))
val bw: BufferedWriter = new BufferedWriter(new OutputStreamWriter(dataOutputStream, "UTF-8"))
println(header)
bw.write(header + "\n")

Generate random data as described above and save it.

val r = scala.util.Random
var i:Int = 0
while (i < records ) {
var pressureInd =  createData(df.describe("pressureInd"))
var moistureInd =  createData(df.describe("moistureInd"))
var temperatureInd =  createData(df.describe("temperatureInd"))
var lifetime =  createData(df.describe("lifetime"))
var str = lifetime + ";" + "0" + ";" + pressureInd + ";" + moistureInd + ";" + temperatureInd +";" + r.nextInt(100) + ";" + "Ford F-750"
println (str)
bw.write(str + "\n")
i = i + 1
}
bw.close
}
}

View the Output Data

The output file is store in Hadoop. So you must use Hadoop commands to view it. Remember that Hadoop is a distributed file system. So it assembles files in parts. In the example I have run here there is only part, which you can see with hadoop fs ls.

hadoop fs -ls /maintenance/2018.04.20.16.26.53.csv
Found 2 items
-rw-r--r--   3 ubuntu supergroup          0 2018-04-20 16:26 /maintenance/2018.04.20.16.26.53.csv/_SUCCESS
-rw-r--r--   3 ubuntu supergroup      17416 2018-04-20 16:26 /maintenance/2018.04.20.16.26.53.csv/part-00000-b1d37f5f-5021-4368-86fd-d941497d8b52-c000.csv

To look at this file use cat.

hadoop fs -cat
/maintenance/2018.04.20.16.26.53.csv/part-00000-b1d37f5f-5021-4368-86fd-d941497d8b52-c000.csv
team,provider,pressureInd,moistureInd,temperatureInd,label,prediction
63,Ford F-750,107.60039392741436,89.98427587791616,48.217222871678814,0.0,1.0
98,Ford F-750,43.28868205264517,127.8055095809048,96.48049423573129,0.0,1.0
23,Ford F-750,122.53982028285051,127.73394439569482,98.44610180531744,0.0,1.0
81,Ford F-750,147.2665064979327,108.80626610625283,101.79608087222353,0.0,1.0
58,Ford F-750,61.40860126097286,79.78449059708598,78.90711442801762,0.0,1.0
]]>
Using Spark with Hive https://www.bmc.com/blogs/using-spark-with-hive/ Fri, 15 Sep 2017 11:00:12 +0000 http://www.bmc.com/blogs/?p=11173 Here we explain how to use Apache Spark with Hive. That means instead of Hive storing data in Hadoop it stores it in Spark. The reason people use Spark instead of Hadoop is it is an all-memory database. So Hive jobs will run much faster there. Plus it moves programmers toward using a common database […]]]>

Here we explain how to use Apache Spark with Hive. That means instead of Hive storing data in Hadoop it stores it in Spark. The reason people use Spark instead of Hadoop is it is an all-memory database. So Hive jobs will run much faster there. Plus it moves programmers toward using a common database if your company runs predominately Spark.

It is also possible to write programs in Spark and use those to connect to Hive data, i.e., go in the opposite direction. But that is not a very likely use case as if you are using Spark you already have bought into the notion of using RDDs (Spark in-memory storage) instead of Hadoop.

Anyway, we discuss the first option here.

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

Prerequisites and Installation

  • You need to install Hive.
  • Install Apache Spark from source code (We explain below.) so that you can have a version of Spark without Hive jars already included with it.
  • Set HIVE_HOME and SPARK_HOME accordingly.
  • Install Hadoop. We do not use it except the Yarn resource scheduler is there and jar files. But Hadoop does not need to be running to use Spark with Hive. However, if you are running a Hive or Spark cluster then you can use Hadoop to distribute jar files to the worker nodes by copying them to the HDFS (Hadoop Distributed File System.)

The instructions here are for Spark 2.2.0 and Hive 2.3.0. Just swap the directory and jar file names below to match the versions you are using. Note that when you go looking for the jar files in Spark there will in several cases be more than one copy. Use the ones in the dist folder as shown below.)

First you need to download Spark source code. They you compile it like this:

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

Next update /usr/share/spark/spark-2.2.0/conf/spark-env.sh and add:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Link Jar Files

Now we make soft links to certain Spark jar files so that Hive can find them:

ln -s /usr/share/spark/spark-2.2.0/dist/jars/spark-network-common_2.11-2.2.0.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/spark-network-common_2.11-2.2.0.jar
ln -s /usr/share/spark/spark-2.2.0/dist/jars/spark-core_2.11-2.2.0.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/spark-core_2.11-2.2.0.jar
ln -s /usr/share/spark/spark-2.2.0/dist/jars/scala-library-2.11.8.jar /usr/local/hive/apache-hive-2.3.0-bin/lib/scala-library-2.11.8.jar

Start Spark master and worker:

Now start Spark.

$SPARK_HOME/sbin/start-all.sh

Make a directory to contain log files:

mkdir /var/log/spark

Edit $HIVE_HOME/conf/hive-site.xml:

<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.master</name>
<value>spark://(your IP address):7077</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>/var/log/spark</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>2048m</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>

Run Hive

Now run Hive as shown below. We are running in local mode as opposed to using the cluster. Note that we tell Hive to log errors to the console so that we can see if anything goes wrong. Also note that we use hive and not beeline, the newer Hive CLI. Hive wants its users to use Beeline, but it is not necessary. (We wrote about how to use beeline here.)

hive --hiveconf hive.root.logger=INFO,console

Edit /usr/hadoop/hadoop-2.8.1/etc/hadoop/yarn-site.xml.

<configuration>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.
FairScheduler</value>
</property>
</configuration>

Create some Data

Now create a table and insert some data. You have to wait a couple of seconds after you type a command in order for it to run since it using Spark and Yarn. Remember this is designed to run across a cluster.

create table students (student string, age int);
insert into table students values('Walker', 33);

The screen will look like this:

Since it takes some time to get the job started, you have time to open the Spark URL on port 8080 to see running programs. Spark removes those when the job completes unless you run the Spark Job History Server.

Now you are done.

]]>
Graphing Spark Data with HighCharts https://www.bmc.com/blogs/graphing-spark-data-with-highcharts/ Tue, 25 Jul 2017 07:05:49 +0000 http://www.bmc.com/blogs/?p=10886 Here we look at how to use HighCharts with Spark. HighCharts is a charting framework written in JavaScript. It works with both static and streaming data. So you can make live charts with it. And their collection of charts is a beautiful set of designs, made larger by the annual competition they hold. HighCharts is […]]]>

Here we look at how to use HighCharts with Spark. HighCharts is a charting framework written in JavaScript. It works with both static and streaming data. So you can make live charts with it. And their collection of charts is a beautiful set of designs, made larger by the annual competition they hold.

HighCharts is free for non-commercial use. It is difficult to master unless you are a JavaScript programmer, so these people have written a framework around it, called spark-highcharts.

One problem with that framework is that there is hardly any documentation. All they provide beyond a couple of examples are JavaDocs. So you could bounce back and forth between that and the HighCharts documentation. If you use it, send them an email and let them know that their community of users is growing. Certainly what they and HighCharts offer are far more options that are built into Zeppelin.

You will also want to study graph styles, as knowing about the different types of charts and the concepts behind them is probably more difficult than writing code to use them. For example, do you know what is a funnel series?

(This tutorial is part of our Apache Spark Guide. Use the right-hand menu to navigate.)

HighCharts and Zeppelin

You can use HighCharts in web pages, with spark-shell, and with Zeppelin. Here we use it with Zeppelin.

You can use this Docker command to download and a Zeppelin bundle with HighCharts already installed:

docker run -p 8080:8080 -d knockdata/zeppelin-highcharts

And use this to stop all Docker containers when done. But export your work as a JSON file first as it will be lost when you do that:

docker stop $(docker ps -aq)

The other alternative is to add these artifacts to the Zeppelin Spark interpreter for Zeppelin that you already have installed. What you are telling Zeppelin here is to reach out to Maven Central and download the Java code you need to make HighCharts work. (Plus it requires lift-json).

Create a simple chart

I am not expert on creating aesthetically pleasing charts, yet. So below we make a simple series, i.e., a chart with an x and y axis.

The code is clear enough and is the same we used to explain how to use Zeppelin and Spark here. We convert all the double to integers and then group them to have only a few data points of simple numbers. Otherwise the chart is too crowded and difficult to read.

import com.knockdata.spark.highcharts._
import com.knockdata.spark.highcharts.model._
import org.apache.spark.sql.types._
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
val data = sc.parallelize(
IOUtils.toString(
new URL("https://raw.githubusercontent.com/cloudera/spark/master/mllib/data/ridge-data/lpsa.data"),
Charset.forName("utf8")).split("\n"))
val schemaStr = "a b c d e"
val fields = schemaStr.split(" ").map(fieldName => StructField(fieldName, DoubleType, nullable = true))
val schema = StructType(fields)
val parsedData = data.map( l => l.replace(",", " ").split(" "))
def toIntSql(i: Double) : Int = { i.toInt }
case class DataClass(a: Double, b: Double, c: Double, d: Double, e: Double)
ar x = parsedData.map(b => DataClass( toIntSql(b(0).toDouble),  toIntSql(b(1).toDouble),  toIntSql(b(2).toDouble),  toIntSql(b(3).toDouble),  toIntSql(b(4).toDouble)))
var df = x.toDF()
df.createOrReplaceTempView(df")
var g = spark.sql("select a, b from df group by a , b")

The charting section, shown below, is fairly self-explanatory. What it shows is that the HighCharts works with data frames, not SQL temporary tables. The arguments you give depend on the chart type. Like we said, the documentation is sparse, but you could look at the spark-highcharts JavaDoc for the series chart here and definition of a series charts by HighCharts here.

highcharts(g
.series("x" -> "a", "y" -> "b")
.orderBy(col("a"))).plot()

Here is our simple chart of x and y where x is the column a in the dataframe and y is the column b. Since it is a series, the x axis should be sorted.

]]>