K-means Clustering with Apache Spark

BY

Here we show a simple example of how to use k-means clustering. We will look at crime statistics from different states in the USA to show which are the most and least dangerous. We get our data from here. The data looks like this. The columns are state, cluster, murder rate, assault, population, and rape. It already includes a cluster column … [Read more...]

Apache Spark: Working with Streams

BY

In the last two posts we wrote, we explained how to read data streaming from Twitter into Apache Spark by way of Kafka. Here we look at a simpler example of reading a text file into Spark as a stream. We make a simple stock ticker that looks like the screen below when we run the code in Zeppelin. Working with streaming data is quite … [Read more...]

Reading Streaming Twitter feeds into Apache Spark

BY

In part 1 of this blog post we explained how to read Tweets streaming off Twitter into Apache Kafka. Here we explain how to read that data from Kafka into Apache Spark. We broke this document into two pieces, because this second piece is considerably more complicated. Prerequisites First install Kafka as shown in Part 1 to verify that you can … [Read more...]

Working with Streaming Twitter Data Using Kafka

BY

Here we show how to read messages streaming from Twitter and store them in Kafka.  In Part 2 we will show how to retrieve those messages from Kafka and read them into Spark Streaming. Overview People use Twitter data for all kinds of business purposes, like monitoring brand awareness.  Twitter, unlike Facebook, provides this data freely.  So you … [Read more...]

Using Zeppelin with Big Data

BY

Zeppelin is an interactive notebook. It lets you write code into a web page, execute it, and display the results in a table or graph. It also does much more as it supports markdown and JavaScript (Angular). So you can write code, hide it from your users, and create beautiful reports and share them. And you can also create real time reports and … [Read more...]

Spark Decision Tree Classifier

BY

Here we explain how to use the Decision Tree Classifier with Apache Spark ML (machine learning). We use data from The University of Pennsylvania here and here. We write the solution in Scala code and walk the reader through each line of the code. Do not bother to read the mathematics part of the lecture notes from Penn, unless you know a lot of … [Read more...]

Using Logistic Regression, Scala, and Spark

BY

Here we explain how to do logistic regression with Apache Spark. Logistic regression (LR) is closely related to linear regression.  But instead of predicting a dependant value given some independent input values it predicts a probability and binary, yes or no, outcome. You use linear or logistic regression when you believe there is some … [Read more...]

SGD Linear Regression Example with Apache Spark

BY

This article explains how to do linear regression with Apache Spark. It assumes you have some basic knowledge of linear regression. If you do not, then you need to learn about it as it is one of the simplest ideas in statistics. Also, most machine language models are an extension of this basic idea. It is so simple to understand and use that you … [Read more...]