K-means Clustering with Apache Spark


Here we show a simple example of how to use k-means clustering. We will look at crime statistics from different states in the USA to show which are the most and least dangerous. We get our data from here. The data looks like this. The columns are state, cluster, murder rate, assault, population, and rape. It already includes a cluster column … [Read more...]

Apache Spark: Working with Streams


In the last two posts we wrote, we explained how to read data streaming from Twitter into Apache Spark by way of Kafka. Here we look at a simpler example of reading a text file into Spark as a stream. We make a simple stock ticker that looks like the screen below when we run the code in Zeppelin. Working with streaming data is quite … [Read more...]

What Is “Jobs-as-Code”?


Is your organization looking to accelerate application delivery and application quality in order to stay competitive in today’s always-on economy? Adopting a Jobs-as-Code approach can transform your business for agile application delivery and processes by avoiding rework and headaches related to your application delivery. In addition, Jobs-as-Code … [Read more...]

Reading Streaming Twitter feeds into Apache Spark


In part 1 of this blog post we explained how to read Tweets streaming off Twitter into Apache Kafka. Here we explain how to read that data from Kafka into Apache Spark. We broke this document into two pieces, because this second piece is considerably more complicated. Prerequisites First install Kafka as shown in Part 1 to verify that you can … [Read more...]

Working with Streaming Twitter Data Using Kafka


Here we show how to read messages streaming from Twitter and store them in Kafka.  In Part 2 we will show how to retrieve those messages from Kafka and read them into Spark Streaming. Overview People use Twitter data for all kinds of business purposes, like monitoring brand awareness.  Twitter, unlike Facebook, provides this data freely.  So you … [Read more...]

Using Zeppelin with Big Data


Zeppelin is an interactive notebook. It lets you write code into a web page, execute it, and display the results in a table or graph. It also does much more as it supports markdown and JavaScript (Angular). So you can write code, hide it from your users, and create beautiful reports and share them. And you can also create real time reports and … [Read more...]

File Transfers: One Digital Business Challenge Solved

BY and

As your company’s business becomes more digital, you’re likely to encounter many issues. The amount and size of data will increase, and there are more applications and systems needing access to that data than ever before, all of which adds to complexity. This has the potential to make your journey to becoming a digital business more … [Read more...]

Spark Decision Tree Classifier


Here we explain how to use the Decision Tree Classifier with Apache Spark ML (machine learning). We use data from The University of Pennsylvania here and here. We write the solution in Scala code and walk the reader through each line of the code. Do not bother to read the mathematics part of the lecture notes from Penn, unless you know a lot of … [Read more...]