Machine Learning – BMC Software | Blogs

Big Data vs Data Analytics vs Data Science: What’s The Difference?

Shanika Wickramasinghe — Wed, 13 Oct 2021 00:00:42 +0000

Data has become the most critical factor in business today. As a result, different technologies, methodologies, and systems have been invented to process, transform, analyze, and store data in this data-driven world.

However, there is still much confusion regarding the key areas of Big Data, Data Analytics, and Data Science. In this post, we will demystify these concepts to better understand each technology and how they relate to each other.

Data TL:DR

Big data refers to any large and complex collection of data.
Data analytics is the process of extracting meaningful information from data.
Data science is a multidisciplinary field that aims to produce broader insights.

Each of these technologies complements one another yet can be used as separate entities. For instance, big data can be used to store large sets of data, and data analytics techniques can extract information from simpler datasets.

Read on for more detail.

What is big data?

As the name suggests, big data simply refers to extremely large data sets. This size, combined with the complexity and evolving nature of these data sets, has enabled them to surpass the capabilities of traditional data management tools. This way, data warehouses and data lakes have emerged as the go-to solutions to handle big data, far surpassing the power of traditional databases.

Some data sets that we can consider truly big data include:

Stock market data
Social media
Sporting events and games
Scientific and research data

(Read our full primer on big data.)

Characteristics of big data

Volume. Big data is enormous, far surpassing the capabilities of normal data storage and processing methods. The volume of data determines if it can be categorized as big data.
Variety. Large data sets are not limited to a single kind of data—instead, they consist of various kinds of data. Big data consists of different kinds of data, from tabular databases to images and audio data regardless of data structure.
Velocity. The speed at which data is generated. In Big Data, new data is constantly generated and added to the data sets frequently. This is highly prevalent when dealing with continuously evolving data such as social media, IoT devices, and monitoring services.
Veracity or variability. There will inevitably be some inconsistencies in the data sets due to the enormity and complexity of big data. Therefore, you must account for variability to properly manage and process big data.
Value. The usefulness of Big Data assets. The worthiness of the output of big data analysis can be subjective and is evaluated based on unique business objectives.

Types of big data

Structured data. Any data set that adheres to a specific structure can be called structured data. These structured data sets can be processed relatively easily compared to other data types as users can exactly identify the structure of the data. A good example for structured data will be a distributed RDBMS which contains data in organized table structures.
Semi-structured data. This type of data does not adhere to a specific structure yet retains some kind of observable structure such as a grouping or an organized hierarchy. Some examples of semi-structured data will be markup languages (XML), web pages, emails, etc.
Unstructured data. This type of data consists of data that does not adhere to a schema or a preset structure. It is the most common type of data when dealing with big data—things like text, pictures, video, and audio all come up under this type.

(Get a deeper understanding of these data types.)

Big data systems & tools

When it comes to managing big data, many solutions are available to store and process the data sets. Cloud providers like AWS, Azure, and GCP offer their own data warehousing and data lake implementations, such as:

AWS Redshift
GCP BigQuery
Azure SQL Data Warehouse
Azure Synapse Analytics
Azure Data Lake

Apart from that, there are specialized providers such as Snowflake, Databriks, and even open-source solutions like Apache Hadoop, Apache Storm, Openrefine, etc., that provide robust Big Data solutions on any kind of hardware, including commodity hardware.

What is data analytics?

Data Analytics is the process of analyzing data in order to extract meaningful data from a given data set. These analytics techniques and methods are carried out on big data in most cases, though they certainly can be applied to any data set.

(Learn more about data analysis vs data analytics.)

The primary goal of data analytics is to help individuals or organizations to make informed decisions based on patterns, behaviors, trends, preferences, or any type of meaningful data extracted from a collection of data.

For example, businesses can use analytics to identify their customer preferences, purchase habits, and market trends and then create strategies to address them and handle evolving market conditions. In a scientific sense, a medical research organization can collect data from medical trials and evaluate the effectiveness of drugs or treatments accurately by analyzing those research data.

Combining these analytics with data visualization techniques will help you get a clearer picture of the underlying data and present them more flexibly and purposefully.

Types of analytics

While there are multiple analytics methods and techniques for data analytics, there are four types that apply to any data set.

Descriptive. This refers to understanding what has happened in the data set. As the starting point in any analytics process, the descriptive analysis will help users understand what has happened in the past.
Diagnostic. The next step of descriptive is diagnostic, which will consider the descriptive analysis and build on top of it to understand why something happened. It allows users to gain knowledge on the exact information of root causes of past events, patterns, etc.
Predictive. As the name suggests, predictive analytics will predict what will happen in the future. This will combine data from descriptive and diagnostic analytics and use ML and AI techniques to predict future trends, patterns, problems, etc.
Prescriptive. Prescriptive analytics takes predictions from predictive analytics and takes it a step further by exploring how the predictions will happen. This can be considered the most important type of analytics as it allows users to understand future events and tailor strategies to handle any predictions effectively.

Accuracy of data analytics

The most important thing to remember is that the accuracy of the analytics is based on the underlying data set. If there are inconsistencies or errors in the dataset, it will result in inefficiencies or outright incorrect analytics.

Any good analytical method will consider external factors like data purity, bias, and variance in the analytical methods. Normalization, purifying, and transforming raw data can significantly help in this aspect.

Data analytics tools & technologies

There are both open source and commercial products for data analytics. They will range from simple analytics tools such as Microsoft Excel’s Analysis ToolPak that comes with Microsoft Office to SAP BusinessObjects suite and open source tools such as Apache Spark.

When considering cloud providers, Azure is known as the best platform for data analytics needs. It provides a complete toolset to cater to any need with its Azure Synapse Analytics suite, Apache Spark-based Databricks, HDInsights, Machine Learning, etc.

AWS and GCP also provide tools such as Amazon QuickSight, Amazon Kinesis, GCP Stream Analytics to cater to analytics needs.

Additionally, specialized BI tools provide powerful analytics functionality with relatively simple configurations. Examples here include Microsoft PowerBI, SAS Business Intelligence, and Periscope Data Even programming languages like Python or R can be used to create custom analytics scripts and visualizations for more targeted and advanced analytics needs.

Finally, ML algorithms like TensorFlow and scikit-learn can be considered part of the data analytics toolbox—they are popular tools to use in the analytics process.

What is data science?

Now we have a clear understanding of big data and data analytics. So—what exactly is data science?

Unlike the first two, data science cannot be limited to a single function or field. Data science is a multidisciplinary approach that extracts information from data by combining:

Scientific methods
Maths and statistics
Programming
Advanced analytics
ML and AI
Deep learning

In data analytics, the primary focus is to gain meaningful insights from the underlying data. The scope of Data Science far exceeds this purpose—data science will deal with everything, from analyzing complex data, creating new analytics algorithms and tools for data processing and purification, and even building powerful, useful visualizations.

Data science tools & technologies

This includes programming languages like R, Python, Julia, which can be used to create new algorithms, ML models, AI processes for big data platforms like Apache Spark and Apache Hadoop.

Data processing and purification tools such as Winpure, Data Ladder, and data visualization tools such as Microsoft Power Platform, Google Data Studio, Tableau to visualization frameworks like matplotlib and ploty can also be considered as data science tools.

As data science covers everything related to data, any tool or technology that is used in Big Data and Data Analytics can somehow be utilized in the Data Science process.

Data is the future

Ultimately, big data, data analytics, and data science all help individuals and organizations tackle enormous data sets and extract valuable information out of them. As the importance of data grows exponentially, they will become essential components in the technological landscape.

Top Machine Learning Frameworks To Use

Walker Rowe — Tue, 08 Sep 2020 00:00:15 +0000

There are many machine learning frameworks. Given that each takes time to learn, and given that some have a wider user base than others, which one should you use?

In this article, we take a high-level look at the major ML frameworks ones—and some newer ones to the scene:

TensorFlow
PyTorch
scikit-learn
Spark ML
Torch
Huggingface
Keras

What’s an ML framework?

Machine learning relies on algorithms. Unless you’re a data scientist or ML expert, these algorithms are very complicated to understand and work with.

A machine learning framework, then, simplifies machine learning algorithms. An ML framework is any tool, interface, or library that lets you develop ML models easily, without understanding the underlying algorithms.

There are a variety of machine learning frameworks, geared at different purposes. Nearly all ML the frameworks—those we discuss here and those we don’t—are written in Python. Python is the predominant machine learning programming language.

Choosing your ML tool

In picking a tool, you need to ask what is your goal: machine learning or deep learning? Deep learning has come to mean using neural networks to perform many tasks to analyze data:

Image data
Language data
Large amounts of numbered and categorical data

Using the data, it is possible to:

Make face-detection models
Manipulate images, like with deep fakes
Generate full-length, almost coherent articles on a given subject
Predict routine behavioral actions, like when a person might cancel their gym membership
Offer recommendations, given you like one restaurant/movie/product, here’s another you will likely enjoy

Machine learning, on the other hand, relies on algorithms based in mathematics and statistics—not neural networks—to find patterns. Most of the tutorials, use cases, and engineering in the newer ML frameworks are targeted towards building the framework that will train itself on image databases or text generation or classification in the fastest time, using the least amount of memory, and run on both GPUs and CPUs.

Why not provide just one overarching API for all ML tasks? Say, an image classification API, and let data scientists simply drop image databases into that? Or provide that as a web service, like Google’s natural language web service.

That’s because data scientists are interested in more than just handwriting recognition for the sake of handwriting recognition. Data scientists are interested in tools that solve problems applicable to business, like linear and logistic regression, k-mean clustering, and, yes, neural networks. In 2020, you have many options for these tools.

Popular machine learning frameworks

Arguably, TensorFlow, PyTorch, and scikit-learn are the most popular ML frameworks. Still, choosing which framework to use will depend on the work you’re trying to perform. These frameworks are oriented towards mathematics and statistical modeling (machine learning) as opposed to neural network training (deep learning).

Here’s a quick breakdown of these popular ML frameworks:

TensorFlow and PyTorch are direct competitors because of their similarity. They both provide a rich set of linear algebra tools, and they can run regression analysis.
Scikit-learn has been around a long time and would be most familiar to R programmers, but it comes with a big caveat: it is not built to run across a cluster.
Spark ML is built for running on a cluster, since that is what Apache Spark is all about.

Now, let’s look at some specific frameworks.

TensorFlow

TensorFlow was developed at Google Brain and then made into an open source project. TensorFlow can:

Perform regression, classification, neural networks, etc.
Run on both CPUs and GPUs

TensorFlow is among the de facto machine learning frameworks used today, and it is free. (Google thinks the library can be free, but ML models use significant resources for production purposes, so they capitalize on selling the resources to run their tools.)

TensorFlow is a full-blown, ML research and production tool. It can be very complex—but it doesn’t have to be. Like an Excel spreadsheet, TensorFlow can be used simply or more expertly:

TF is simple enough for the basic user who wants to return a prediction on a given set of data
TF can also work for the advanced user who wishes to set up multiple data pipelines, transform the data to fit their model, customize all layers and parameters of their model, and train on multiple machines while maintaining privacy of the user.

TF requires an intimate understanding of NumPy arrays. TensorFlow is built of tensors. It’s a way to process tensors; hence Python’s NumPy tool. NumPy is a Python framework for working with n-dimensional arrays (A 1-dimensional array is a vector. A 2-dimensional array is a matrix, and so forth.) Instead of doing things like automatically converting arrays to one-hot vectors (a true-false representation), this task is expected to be handled by the data scientist.

But TensorFlow has a rich set of tools. For example, the activation functions for neural networks can do all the hard work of statistics. If we define deep learning as the ability to do neural networks, then TensorFlow does that. But it can also handle more everyday problems, like regression.

A simple TF ML model looks like this:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

A simple model and more advanced model can be seen here.

PyTorch

PyTorch was developed by FAIR, Facebook AI Research. In early 2018, the FAIR team merged Caffe2, another ML framework, into PyTorch. It is the leading competitor to TensorFlow. When engineers are deciding to use a ML platform, their choice generally comes down to, “Do we use TensorFlow or PyTorch?” They each serve their purposes but are pretty interchangeable.

Like TensorFlow, PyTorch:

Does regression, classification, neural networks, etc.
Runs on both CPUs and GPUs.

PyTorch is considered more pythonic. Where TensorFlow can get a model up and running faster and with some customization, PyTorch is considered more customizable, following a more traditional object-oriented programming approach through building classes.

PyTorch is shown to have faster training times. This speed is marginal for many users but can make a difference on large projects. PyTorch and TensorFlow are both in active development, so the speed comparison is likely to waiver back and forth between the two.

Relative to Torch, PyTorch uses Python and has no need for Lua or the Lua Package Manager.
From Asad Mahmood, a PyTorch model looks like this:

import torch
from torch.autograd import Variableclass linearRegression(torch.nn.Module):
    def __init__(self, inputSize, outputSize):
        super(linearRegression, self).__init__()
        self.linear = torch.nn.Linear(inputSize, outputSize)

    def forward(self, x):
        out = self.linear(x)
        return out

inputDim = 1 # takes variable 'x' 
outputDim = 1 # takes variable 'y'
learningRate = 0.01 
epochs = 100

model = linearRegression(inputDim, outputDim)
##### For GPU #######
if torch.cuda.is_available():
model.cuda()

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learningRate)

for epoch in range(epochs):
   # Converting inputs and labels to Variable
   if torch.cuda.is_available():
       inputs = Variable(torch.from_numpy(x_train).cuda())
       labels = Variable(torch.from_numpy(y_train).cuda())
   else:
       inputs = Variable(torch.from_numpy(x_train))
       labels = Variable(torch.from_numpy(y_train))

   # Clear gradient buffers because we don't want any gradient from previous epoch to carry forward, dont want to cummulate gradients
   optimizer.zero_grad()

   # get output from the model, given the inputs
   outputs = model(inputs)

   # get loss for the predicted output
   loss = criterion(outputs, labels)
   print(loss)
   # get gradients w.r.t to parameters
   loss.backward()

   # update parameters
   optimizer.step()

   print('epoch {}, loss {}'.format(epoch, loss.item()))

scikit-Learn

Sometimes, only a quick test is needed to measure the likely success of a hypothesis. Scikit-learn is an old standards of the data science world, and it can be good to run quick ML model sketches, to see if a model might have some interpretability.

Scikit is another Python package that can perform many useful machine learning tasks:

Linear regression
Decision tree regressions
Random Forest regressions
K-Nearest neighbor
SVMs
Stochastic Gradient Descent models
And more

Scikit provides model analysis tools like the confusion matrix for assessing how well a model performed. Many times, you can start an ML job in scikit-learn and then move to another framework. For example, scikit-learn has excellent data pre-processing tools for one-hot encoding categorical data. Once the data is pre-processed through Scikit, you can move it into TensorFlow or PyTorch.

from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
print(regr.coef_)

Spark ML

We have written at length about how to use Spark ML. As we described earlier, Spark ML can run in clusters. In other words, it can handle really large matrix multiplication by taking slices of the matrix and running that calculation on different servers. (Matrix multiplication is among the most important ML operations.) That requires a distributed architecture, so your computer does not run out of memory or run too long when working with large amounts of data.

Spark ML is complicated, but instead of having to work with NumPy arrays, it lets you work with Spark RDD data structures, which anyone using Spark in its big data role will understand. And you can use Spark ML to work with Spark SQL dataframes, which most Python programmers know. So it creates dense and spark feature-label vectors for you, taking away some complexity of preparing data to feed into the ML algorithms.

In January 2019, Yahoo released TensorFlowOnSpark, a library that “combines salient features from the TensorFlow deep Learning framework with Apache Spark and Apache Hadoop. TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.” The package integrates big data and machine learning into a good-to-use ML tool for large production use-cases.

The Spark ML model, written in Scala or Java, looks similar to the TensorFlow code, in that it is more declarative:

// Build the model with the desired layers
val training = sparkContext.parallelize(Seq(
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5))))
val lr = new LogisticRegression()

// We may set parameters using setter methods.
lr.setMaxIter(10)
  .setRegParam(0.01)

// Train a LogisticRegression model.  This uses the parameters stored in lr.
val model1 = lr.fit(training)

Torch

Torch claims to be the easiest ML framework. It is an old machine learning library, first released in 2002.

Before, with PyTorch, Python was the chosen method to access the fundamental tables in which Torch performs its calculations. Torch itself can be used using Lua, with the LuaRocks Package Manager. Torch’s relative simplicity comes from its Lua programming language interface (There are other interfaces, like QT, and iPython/Jupyter, and it has a C implementation.). Lua is indeed simple. There are no floats or integers, just numbers. And all objects in Lua are tables. So, it’s easy to create data structures. And it provides a rich set of easy-to-understand features to slice tables and add to them.

Like TensorFlow, the basic data element in Torch is the tensor. You create one by writing torch.Tensor. The CLI (command line interface) provides inline help and it helps with indentation. People who have used Python will be relieved, as this means you can type functions in situ without having to start over at the beginning when you make a mistake. And for those who like complexity and sparse code, Torch supports functional programming.

New ML framework types

The Machine Learning world is rich with libraries. There exist high-level libraries which use some of these previous mentioned libraries as their base in order to make machine learning easier for the data scientist.

huggingface.co

One of the top machine learning libraries is huggingface.co’s, which creates good base models for researchers built on top of TensorFlow and PyTorch. They adapt complicated tools, such as GPT-2, to work easily on your machine.

Keras

Keras is a neural network library built on top of TensorFlow to make ML modelling straightforward. It simplifies some of the coding steps, like offering all-in-one models, Keras can also use the same code to run on a CPU or a GPU.

Keras isn’t limited to TensorFlow, though it’s most commonly used there. You can also use Keras with:

Microsoft Cognitive Toolkit (CNTK)
R
Theano
PlaidML

Additional resources

For more on machine learning, explore the BMC Machine Learning & Big Data Blog and these resources:

Enabling the Citizen Data Scientists
3 Keys to Building Resilient Data Pipelines
5 Tribes of Machine Learning
Machine Learning with TensorFlow & Keras, a multi-part Guide
sci-kit learn Guide
Apache Spark Guide
O’Reilly’s Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (For those who prefer some old-fashioned book learning)

Tuning Machine Language Models for Accuracy

Walker Rowe — Fri, 20 Jul 2018 00:00:48 +0000

Continuing with our explanations of how to measure the accuracy of an ML model, here we discuss two metrics that you can use with classification models: accuracy and receiver operating characteristic area under curve. These are some of the metrics suitable for classification problems, such a logistic regression and neural networks. There are others that we will discuss in subsequent blog posts.

For data, we use this this data set posted by an anonymous statistics professor. The Zeppelin notebook for the code shown below is stored here.

The Code

We use Pandas and scikit-learn to do the heavy lifting. We read the data into a dataframe then take two slices, x is columns 2 through 16. x is the column labeled ‘Buy’. Since this is a logistic regression problem y is equal to either 1 or 0.

import pandas as pd

url = 'https://raw.githubusercontent.com/werowe/logisticRegressionBestModel/master/KidCreative.csv'

data = pd.read_csv(url, delimiter=',')

y=data['Buy']
x = data.iloc[:,2:16]

Next we use two of the classification metrics available to us: accuracy and roc_auc. We explain those below.

First, we can comment on cross-validation, used in the algorithm below. We use model_selection.cross_val_score with cv=kfold. Basically, what this does is test predictions against observed values by looping over different divisions of the input data and taking the average of the area. This is helpful mainly with small data sets when you don’t have enough training data to split it into test, training, and validation sets.

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

for scoring in["accuracy", "roc_auc"]:
    seed = 7
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    model = LogisticRegression()
    results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring=scoring)
    print("Model", scoring, " mean=", results.mean() , "stddev=", results.std())

Results in:

Model accuracy mean= 0.8886084284460052 stddev= 0.03322328503979156
Model roc_auc mean= 0.9185419071788103 stddev= 0.05710985874305497

And the individual scores:

print ("scores", results)

Model accuracy  mean= 0.8886084284460052 stddev= 0.03322328503979156
scores [0.88235294 0.83823529 0.91176471 0.89552239 0.92537313 0.91044776
 0.92537313 0.82089552 0.89552239 0.88059701]

Model roc_auc  mean= 0.9185419071788103 stddev= 0.05710985874305497
scores [0.93459119 0.92618224 0.95555556 0.94871795 0.94242424 0.89298246
 0.93874644 0.75471698 0.93993506 0.95156695]

According to Wikipedia, accuracy and precision are defined to be “In simplest terms, given a set of data points from repeated measurements of the same quantity, the set can be said to be precise if the values are close to each other, while the set can be said to be accurate if their average is close to the true value of the quantity being measured.”

Then the ROC: “A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings”

In other words a threshold is set and then the ratio of false positive and false negatives is calculated.

We calculate that as shown below. We can run this calculation on the training data. In other words we feed the actual y values and the predicted ones, model.predict(x), into roc_curve().

model.fit(x, y)
predict = model.predict(x)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y, predict)
print ("fpr=", fpr)
print ("tpr=", tpr)
print ("thresholds=", thresholds)

results in:

fpr= [0.         0.05839416 1.        ]
tpr= [0.    0.664 1.   ]
thresholds= [2 1 0]

We can calculate the area under the curve the receiver operating characteristic (ROC) curve:

auc = roc_auc_score(y, predict)
print (auc)

Results in:

0.8028029197080292

We can plot the ROC curve like this:

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

fpr = dict()
tpr = dict()
roc_auc = dict()

n_classes = 2
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(predict, y)
    roc_auc[i] = auc(fpr[i], tpr[i], auc(y,predict, reorder=True))

plt.figure()
lw = 2
plt.plot(fpr[0], tpr[0], color='darkorange',lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

Results in this plot:

Introduction to Google Cloud Machine Learning Engine

Walker Rowe — Thu, 10 May 2018 00:00:36 +0000

The Google Cloud Machine Learning Engine is almost exactly the same as Amazon Sagemaker. It is not a SaaS program that you can just upload data to and start using like the Google Natural Language API. Instead, you have to program Google Cloud ML using any of the ML frameworks such as TensorFlow, scikit-learn, XGBoost, or Keras. Then Google spins up an environment to run the training models across its cloud. So in a word, GC ML is just a computing platform to run ML jobs. By itself it does not do ML. You have to code that.

For example, you use the command below to start or otherwise manage a job. This is the command line interface to the Google Cloud ML Engine SDK. You install that and other SDKs locally on a VM to work with the product.

gcloud ml-engine GROUP | COMMAND [GCLOUD_WIDE_FLAG …]

What these jobs do is train algorithms and then make predictions or do classification. That is what ML does. Doing this involves solving very large sets of equations at the same time, which requires lots of computing power. So what the GC ML platform does for you is provision the resources needed to do that. Amazon SageMaker uses containers. Google says they do not. So they use some other container-like approach. (They could also be providing you with access to Google TPUs, which are Google proprietary application-specific chips tuned for ML mathematics.)

In other words, most ML works on the same principle which is to iteratively look at a set of equations and then try different coefficients to minimize the error using some loss function. This requires multiplying n-dimensional matrices which is a problem that lends itself well to running across a cloud. This is because such a problem can be divided into smaller pieces.

GC ML creates this cloud on-the-fly for you. So you do not have to set up virtual machines and containers ahead of time by yourself. That is the value that it delivers.

It can take some time, sometimes hours or even days, to train a neural network or other ML problem. But once saved that data model be used to do classification or make predictions. So you do not have to run that large training job again.

Google even says it will let you import training models created on other platforms. I did not look at that part but presumably that must let you import, for example, training models saved as Hadoop Parquet files on another cloud.

Getting Started

To get started with the product you can walk through this tutorial by Google.

This example is Python code using TensorFlow. To use it you:

provision a Google Cloud virtual machine (so you can have some place to write the code).
install the Google Cloud SDKs.
create a model in the Google Cloud ML console.
run the gcloud ml-engine command (shown below) to kick off the job.
inspect the output.

For example, the console looks like this:

You submit a job like this:

gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.4 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--eval-steps 100 \
--verbosity DEBUG

This command-line approach is different from Amazon SageMaker which uses a graphical interface.

Pricing

Google Cloud ML is not free. However, when you first sign up you get $300 credit that is good for 1 year. Contrast that with Amazon, where you start incurring charges right away. Charges from both companies coming from the provisioning of virtual machines and containers needed to run the ML jobs. So they do not charge you as you write code. They just charge you when you run it.

Required Skills

All of the ML frameworks on the Google ML cloud are written in Python. These are all open source projects.

They are all complicated and require an understanding of data science to use. That means you need to know linear algebra, advanced statistics, neural networks, gradient descent, regression, etc. So these are not tools for ordinary programmers, they are for data scientists.

What Can you do with Google Cloud ML

Google Cloud ML lets you solve business and scientific problems such as logistic and linear regression, classification, and neural networks. You can use these to make predictions and to classify data. For example, it could do handwriting recognition. But a more business-like problem would be something like vehicle preventive maintenance or calculating the efficacy of the advertising budget.

AWS Linear Learner: Using Amazon SageMaker for Logistic Regression

Walker Rowe — Mon, 16 Apr 2018 00:00:58 +0000

In the last blog post we showed you how to use Amazon SageMaker. So read that one before you read this one because there we show screen prints and explain how to use the graphical interface of the product, including its hosted Jupyter Notebooks feature. We also introduced the SageMaker API, which is a front end for Google TensorFlow and other opensource machine learning APIs. Here we focus more on the code than how to use the SageMaker interface.

In the last example we used k-means clustering. Here we will do logistic regression. Amazon calls their linear regression and logistic regression algorithms Linear Learner. The complete code for this blog post example is here.

We take the simplest possible example using data from Wikipedia. This is much easier than the examples provided by Amazon which use very large datasets and are geared toward handwriting recognition, etc. Most business problems are not handwriting recognition, but more everyday tasks, like preventive maintenance.

Here we only have 20 data records. That is too small to split the data into train, test, and validation data sets. So we will use the training data only and skip the validation step. Of course in a real world scenario you would want to validate how accurate your model is.

The data below shows what is the likelihood that a student will pass a certain test given how many hours they study.

*Hours*	0.50	0.75	1.00	1.25	1.50	1.75	1.75	2.00	2.25	2.50	2.75	3.00	3.25	3.50	4.00	4.25	4.50	4.75	5.00	5.50
*Pass*	0	0	0	0	0	0	1	0	1	0	1	0	1	0	1	1	1	1	1	1

Normally you would read the data from a .csv file. But there are so few records we can put this right into the code. So create a condaPython3 notebook in SageMaker and paste in the following code.

Below we take the grades and pass-fail and make a tuple:

study=((0.5,0),(0.75,0),(1.0,0),(1.25,0),(1.50,0),(1.75,0),(2.0,0),(2.25,1),(2.5,0),(2.75,1),(3.0,0),(3.25,1),(3.5,0),(4.0,1),(4.25,1),(4.5,1),(4.75,1),(5.0,1),(5.5,1))

Then we convert this to a numpy array. It has to be of type float32, as that is what the SageMaker Linear Learner algorithm expects.

We then take a slice and put the labels (i.e., pass-fail) into the field labels. The Linear Learner algorithms expects a features matrix and labels vector.

import numpy as np
a = np.array(study).astype('float32')

labels = a[:,1]

In the last example we used the record_set() method to upload the data to S3. Here we use the algorithms provided by Amazon to upload the training model and the output data set to S3.

Create a bucket in S3 that begins with the letters sagemaker. Then Amazon will create the subfolders, which in needs, which in this case are sagemaker/grades and others. It is important that you create the S3 buckets in the same Amazon region as your notebook. Otherwise Amazon will throw an error saying it cannot find your data. See the note below on that.

Copy this text into a notebook cell and then run it.

import boto3
 
sess = sagemaker.Session()
bucket = "sagemakerwalkerlr"
prefix = "sagemaker/grades"

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, a, labels)
buf.seek(0)

key = 'linearlearner'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('training artifacts will be uploaded to: {}'.format(output_location)

Amazon will respond:

uploaded training data location: s3://sagemakerwalkerml/sagemaker/grades/train/linearlearner
training artifacts will be uploaded to: s3://sagemakerwalkerml/sagemaker/grades/output

Below we copy the code from Amazon that tells it which Docker container to use and which version of the algorithm. Here we use version latest. Below I put Amazon zone us-east-1 because this is where I created my notebook. You can look at other examples of Amazon code to get the name of the 4 containers and which Docker containers to use.)

 containers = {
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest'
              }

You can check to see from which region Amazon will pull the Docker image by putting this line into the notebook and look at the output. So your Amazon S3 buckets should be there.

containers[boto3.Session().region_name]

Here is the output for my notebook.

'382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest'

Now we begin to set up the Estimator. Amazon will not let you use any of their smaller (i.e. less expensive) images, so here we use a virtual machine of size ml.p2.xlarge.

linear = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                       role=role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.p2.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess)

Now we provide hyperparameters. There are many, like which loss function to use. Here we put only the most important ones:

feature_dim—is the number of columns in our feature array. In this case it is 2: hours of study and pass-fail.
mini_batch_size—is the number of batches into which to split the data. This number should be smaller than the number of records in our training set. We only have 20 records, so 4 will work.
predictor_type—we use binary_classifier, which means logistic regression.

When you run the fit() method Amazon will kick off this job. This will take several minutes to run.

%%time
linear.set_hyperparameters(feature_dim=2,
                           mini_batch_size=4,
                           predictor_type='binary_classifier')

linear.fit({'train': s3_train_data})

Amazon responds like this. Wait several minutes for the job to complete.

INFO:sagemaker:Creating training-job with name: linear-learner-2018-04-07-14-33-25-761

Docker entrypoint called with argument(s): train
…

===== Job Complete =====
Billable seconds: 173
CPU times: user 344 ms, sys: 32 ms, total: 376 ms
Wall time: 6min 8s

When the training model is done, deploy it to an endpoint. Remember that Amazon is charging you money now. So when you get done delete your endpoints unless you want to be charged.

linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.p2.xlarge')

Amazon responds:

INFO:sagemaker:Creating model with name: linear-learner-2018-04-07-14-40-41-204
INFO:sagemaker:Creating endpoint with name linear-learner-2018-04-07-14-33-25-761

Now copy this code. We will put just 1 record a[0] into the linear_predictor. The value is 0.5 hours, so obviously we expect this student to fail.

from sagemaker.predictor import csv_serializer, json_deserializer

linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

a[0]

array([0.5, 0. ], dtype=float32)

Now we run the prediction.

result = linear_predictor.predict(train_set[0])
print(result)

Amazon shows us what we would expect, which is that the student is most likely to fail having studied only ½ an hour.

{'predictions': [{'score': 0.004625876434147358, 'predicted_label': 0.0}]}

Amazon SageMaker: A Hands-On Introduction

Walker Rowe — Thu, 29 Mar 2018 00:00:49 +0000

Amazon SageMaker is a managed machine learning service (MLaaS). SageMaker lets you quickly build and train machine learning models and deploy them directly into a hosted environment. In this blog post, we’ll cover how to get started and run SageMaker with examples.

One thing you will find with most of the examples written by Amazon for SageMaker is they are too complicated. Most are geared toward working with handwriting analysis etc. So here we make a new example, based upon something simpler: a spreadsheet with just 50 rows and 6 columns.

Still, SageMaker is far more complicated than Amazon Machine Learning, which we wrote about here and here. This is because SageMaker is not a plug-n-play SaaS product. You do not simply upload data and then run an algorithm and wait for the results.

Instead SageMaker is a hosted Jupyter Notebook (aka iPython) product. Plus they have taken parts of Google TensorFlow and scikit-learn ML frameworks and written the SageMaker API on top of that. This greatly simplifies TensorFlow programming.

SageMaker provides a cloud where you can run training jobs, large or small. As we show below, it automatically spins up Docker containers and runs your training model across as many CPUs, GPUs, and memory that you need. So it lets you write and run ML models without having to provision EC2 virtual machines yourself to do that. It does the container orchestration for you.

What you Need to Know

In order to follow this code example, you need to understand Jupyter Notebooks and Python.

Jupyter is like a web page Python interpreter. It lets you write code and execute it in place. And it lets you draw tables and graphs. With it you can write programs, hide the code, and then let other users see the results.

Pricing

SageMaker is not free. Amazon charges you by the second. In writing this paper Amazon billed me $19.45. If I had used it within the first two months of signing up with Amazon it would have been free.

SageMaker Notebook

To get started, navigate to the Amazon AWS Console and then SageMaker from the menu below.

Then create a Notebook Instance. It will look like this:

Then you wait while it creates a Notebook. (The instance can have more than 1 notebook.)

Create a notebook. Use the Conda_Python3 Jupyter Kernel.

KMeans Clustering

In this example, we do KMeans clustering. That takes an unlabeled dataset and groups them into clusters. In this example, we take crime data from the 50 American states and group those. So we can then show which states have the worst crime. (We did this previously using Apache Spark here.) We will focus on getting the code working and not interpreting the results.

Download the data from here. And change these columns headings:

,crime$cluster,Murder,Assault,UrbanPop,Rape

To something easier to read:

State,crimeCluster,Murder,Assault,UrbanPop,Rape
Alabama,4,13.2,236,58,21.2
Alaska,4,10,263,48,44.5
Arizona,

Amazon S3

You need to upload the data to S3. Set the permissions so that you can read it from SageMaker. In this example, I stored the data in the bucket crimedatawalker. Amazon S3 may then supply a URL.

Amazon will store your model and output data in S3. You need to create an S3 bucket whose name begins with sagemaker for that.

Basic Approach

Our basic approach will be to read the comma-delimited data into a Pandas dataframe. Then we create a numpy array and pass that to the SageMaker KMeans algorithm. If you have worked with TensorFlow, you will understand that SageMaker is far easier than using that directly.

We also use SageMaker APIs to create the training model and execute the jobs on the Amazon cloud.

The Code

You can download the full code from here. It is a SageMaker notebook.

Now we look at parts of the code.

The %sc line below is called Jupyter Magic. The %sc means run a shell command. In this case we download the data from S3 so that the file crime.csv can be read by the program.

%sc
!wget 'https://s3-eu-west-1.amazonaws.com/crimedatawalker/crime_data.csv'

Next we read the csv file crime_data.csv into a Pandas Dataframe. We convert the state values to numbers since numpy arrays must contain only numeric values. We will also make a cross reference so that later we can print the state name in text given the numeric code.

As the end we convert the dataframe to a numpy array of type float32. The KMeans algorithm expects the float32 format (They call it dtype).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
crime = pd.read_csv('crime_data.csv', header=0)
print(crime.head())

This subroutine converts every letter of the state name to its ASCII integer representation then adds them together.

def stateToNumber(s):
l = 0
for x in s:
l = l + int(hex(ord(x)),16)
return l

Here we change the State column in the dataframe to its numeric representation.

xref = pd.DataFrame(crime['State'])
crime['State']=crime['State'].apply(lambda x: stateToNumber(x))
crime.head()

Now we convert the dataframe to a Numpy array:

crimeArray = crime.as_matrix().astype(np.float32)

Here we give SageMaker the name of the S3 bucket where we will keep the output. The code below that is standard for any of the algorithms. It sets up a machine learning task.

Note that we used machine size ml.c4.8xlarge. Anyone familiar with Amazon virtual machine subscription fees will be alarmed as a machine of that size costs a lot to use. But Amazon will not let you use the tiny or small templates.

from sagemaker import KMeans
from sagemaker import get_execution_role
role = get_execution_role()
print(role)
bucket = "sagemakerwalkerml"
data_location = "sagemakerwalkerml"
data_location = 's3://{}/kmeans_highlevel_example/data'.format(bucket)
output_location = 's3://{}/kmeans_example/output'.format(bucket)
print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))
kmeans = KMeans(role=role,
train_instance_count=1,
train_instance_type='ml.c4.8xlarge',
output_path=output_location,
k=10,
data_location=data_location)

Now we run this code and the Jupyter Notebook cells above it. Then we can stop there since we will now have to wait for Amazon to complete the batch job it creates for us.

Amazon creates a job in SageMaker which we can then see in the Amazon SageMaker dashboard (See below.). Also you will see that the cell in the notebook will have an asterisk (*) next to it, meaning it is busy. Just wait until the asterisk goes away. Then Amazon will update the display and you can move to the next step.

Here is what it looks like when it is done.

arn:aws:iam::782976337272:role/service-role/AmazonSageMaker-ExecutionRole-20180320T064166
training data will be uploaded to: s3://sagemakerwalkerml/kmeans_highlevel_example/data
training artifacts will be uploaded to: s3://sagemakerwalkerml/kmeans_example/output

Next we drop the State name (which has already been turned into a number) from the numpy array. Because the name by itself does not mean anything so we do not want to feed it into the Kmeans algorithm.

slice=crimeArray[:,1:5]

Below the magic %%time tells Jupyter that this step will take some time so wait before moving forward. kmeans.fit() means run the model using the Numpy array kmeans.record_set(numpy Array).

%%time
kmeans.fit(kmeans.record_set(slice))

Amazon will respond saying it has kicked off this job. Then we wait 10 minutes or so for it to complete.

INFO:sagemaker:Creating training-job with name: kmeans-2018-03-27-08-32-53-716

The SageMaker Dashboard

Now go look at the SageMaker Dashboard and you can see the status of jobs you have kicked off. They can take some minutes to run.

When the job is done it writes this information to the SageMaker notebook. You can can see it created a Docker container to run this algorithm.

Docker entrypoint called with argument(s): train
[03/27/2018 14:11:37 INFO 139623664772928] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300',
u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000', u'half_life_time_size': u'0', u'_num_slices': u'1'}
…
[03/27/2018 14:11:38 INFO 139623664772928] Test data was not provided.
#metrics {"Metrics": {"totaltime": {"count": 1, "max": 323.91810417175293, "sum": 323.91810417175293, "min": 323.91810417175293}, "setuptime": {"count": 1, "max": 14.310121536254883, "sum": 14.310121536254883, "min": 14.310121536254883}}, "EndTime": 1522159898.226135, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/KMeansWebscale"}, "StartTime": 1522159898.224455}
===== Job Complete =====
CPU times: user 484 ms, sys: 40 ms, total: 524 ms
Wall time: 7min 38s

Deploy the Model to Amazon SageMaker Hosting Services
Now we deploy the model to SageMaker using kmeans.deploy().

%%time
kmeans_predictor = kmeans.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')

Amazon responds:

INFO:sagemaker:Creating model with name: kmeans-2018-03-27-09-07-32-599
INFO:sagemaker:Creating endpoint with name kmeans-2018-03-27-08-49-03-990

And we can see that the Notebook is busy because there is an asterisk next to the item in Jupyter.

Validate the model

The next step is to use the model and see how well it works. We will feed 1 record into it before we run the whole test data set against it. Here we use the same crime_data.csv data for the train and test data set. The normal approach is to split those into 70%/30%. Then you get brand new data and plug that in when you make predictions.

First take all the rows but drop the first column.

slice=crimeArray[:,1:5]
slice.shape
slice

Now grab just one row for our initial test.

s=slice[1:2]

Now run the predict() method. Take the results and turn it into a dictionary. Then print the results.

%%time
result = kmeans_predictor.predict(s)
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]
i = 0
for r in result:
out = {
"State" : crime['State'].iloc[i],
"StateCode" : xref['State'].iloc[i],
"closest_cluster" :  r.label['closest_cluster'].float32_tensor.values[0],
"crimeCluster" : crime['crimeCluster'].iloc[i],
"Murder" :  crime['Murder'].iloc[i],
"Assault" : crime['Assault'].iloc[i],
"UrbanPop" : crime['UrbanPop'].iloc[i],
"Rape" :  crime['Rape'].iloc[i]
}
print(out)
i = i + 1

Here are the results. For that first record, it has placed it in cluster 6. It also calculated the mean squared distance, which we did not print out.

{'State': 671, 'StateCode': 'Alabama', 'closest_cluster': 7.0, 'crimeCluster': 4, 'Murder': 13.199999999999999, 'Assault': 236, 'UrbanPop': 58, 'Rape': 21.199999999999999}

And now all 50 states.

%%time
result = kmeans_predictor.predict(slice)
clusters = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]
i = 0
for r in result:
out = {
"State" : crime['State'].iloc[i],
"StateCode" : xref['State'].iloc[i],
"closest_cluster" :  r.label['closest_cluster'].float32_tensor.values[0],
"crimeCluster" : crime['crimeCluster'].iloc[i],
"Murder" :  crime['Murder'].iloc[i],
"Assault" : crime['Assault'].iloc[i],
"UrbanPop" : crime['UrbanPop'].iloc[i],
"Rape" :  crime['Rape'].iloc[i]
}
print(out)
i = i + 1
{'State': 671, 'StateCode': 'Alabama', 'closest_cluster': 1.0, 'crimeCluster': 4, 'Murder': 13.199999999999999, 'Assault': 236, 'UrbanPop': 58, 'Rape': 21.199999999999999}
{'State': 589, 'StateCode': 'Alaska', 'closest_cluster': 7.0, 'crimeCluster': 4, 'Murder': 10.0, 'Assault': 263, 'UrbanPop': 48, 'Rape': 44.5}
{'State': 724, 'StateCode': 'Arizona', 'closest_cluster': 3.0, 'crimeCluster': 4, 'Murder': 8.0999999999999996, 'Assault': 294, 'UrbanPop': 80, 'Rape': 31.0}
{'State': 820, 'StateCode': 'Arkansas', 'closest_cluster': 6.0, 'crimeCluster': 3, 'Murder': 8.8000000000000007, 'Assault': 190, 'UrbanPop': 50, 'Rape': 19.5}
{'State': 1016, 'StateCode': 'California', 'closest_cluster': 3.0, 'crimeCluster': 4, 'Murder': 9.0, 'Assault': 276, 'UrbanPop': 91, 'Rape': 40.600000000000001}
{'State': 819, 'StateCode': 'Colorado', 'closest_cluster': 6.0, 'crimeCluster': 3, 'Murder': 7.9000000000000004, 'Assault': 204, 'UrbanPop': 78, 'Rape': 38.700000000000003}

As an exercise you could run this again and drop the crimeCluster column. The data we have already includes clusters that someone else calculated. So we should get rid of that.

Note also that you cannot rerun the steps where it creates the model unless you change the data in some way. Because it will say the model already exists. But you can run any of the other cells over and over as it persists the data. For example you could experiment with adding graphs or changing the output to a dataframe to make it easier to read.

Additional Resources

Build, Train, and Deploy ML Models Quickly and Easily with Amazon SageMaker, ft. Intuit (AIM404-R2) – AWS re:Invent 2018 from Amazon Web Services

Linear Regression with Amazon AWS Machine Learning

Walker Rowe — Thu, 22 Mar 2018 00:00:41 +0000

Here we show how to use Amazon AWS Machine Learning to do linear regression. In a previous post, we explored using Amazon AWS Machine Learning for Logistic Regression.

To review, linear regression is used to predict some value y given the values x1, x2, x3, …, xn. in other words it finds the coefficients b1, b2, b3, … , bn plus an offset c to yield this formula:

y = b1x1 + b2x2 + b3x3 + …. + c.

It uses the least squares error approach to find this formula. In other words, think of all these values x1, x2, … existing in some N-dimensional space. The line y is the line that minimizes the distance between the observed and predicted values for all these values. So it is the line that most nearly split right down the middle of the data observed in the training set. Since we know what that line looks like, we can take any new data, plug those into the formula, and then make a prediction.

As always models are built like this:

Take an input set of data that you think it correlated. Such as hours of exercise and weight reduction.
Split that data into a training set and testing set. Amazon does that splitting for you.
Run the linear regression algorithm to find the formula for y. Amazon picks linear regression based upon the characteristics of the data. It would pick another type of regression or classification model is we picked a data set that for which that was a better fit.
Check how accurate the model is by taking the square root of the differences between the observed and predicted values. Amazon actually uses the mean of this difference.
Then take new data and apply the formula y to make a prediction.

Get Some Data

We will use this data of student test scores from the UCI Machine Learning repository.

I copied this data into Google Sheets here so that you can more easily read it. Plus I show the training data set and the one used for prediction.

You download this data in raw format and upload it to Amazon S3. But first, we have to delete the column headings and change the semicolon (;) separators to commas (,) as shown below. We take the first 400 rows as our model training data and the last 249 for prediction. Use vi to delete the first from the data as Amazon will not read the schema automatically (Too bad it does not).

vi student-por.csv 
sed -i 's/;/,/g' student-por.csv
head -400 student-por.csv > grades400.csv
tail -249 student-por.csv > grades249.csv

Now create a bucket in S3. I called it gradesml. Call yours some different name as it appears bucket names have to be unique across all of S3.

Then upload all 3 files.

Note the https link and make sure the permissions are set to read.

Give read permissions:

Click on Amazon Machine Learning and then Create New Data Source/ML Model. If you have not used ML before it will ask you to sign up. Creating and evaluating models is free. Amazon charges you for using them to make prediction on a per 1,000 record basis.

Click create new Datasource and ML model.

Fill in the S3 location below. Notice that you do not use the URL. Intead, put the bucket name and file name:

Click verify and Grant Permissions on the screen that pops up next.

Give the data source some name then click through the screens. It fill make up field names (we actually don’t care what names it uses since we know what each column means from the original data set). It will also determine whether each value is categorical (drawn from a finite set) or just a number. What is important for you to do is to pick the target. That is the dependant value you want it to predict, i.e., y. From the input data student-por.csv pick G3, as that is the student’s final grade. These grades are from the Portuguese grammar school system and 13 is the highest value.

Below don’t use students-por.csv as the input data. Instead use grades400.csv.

Now Amazon builds the model. This will take a few minutes.

While waiting are create another data set. This is not a model so it will not ask you for a target. Use the grades249.csv file in S3, which we will use in the batch prediction step.

Now the evaluation is done. We can see which one it is from the list above as it says evaluation. Click on it. We explain what it means below.

Amazon shows the RMSE. This is the square root of the sum of the squared differences of the observed and predicted values. We square and then take the square root so that all the numbers are positive, so they do not cancel each other out. Amazon also uses the mean, meaning average, by multiplying this sum by 1 / n, where n is the sample size.

If the model and the evaluations were the same, this number would be 0. So the closer to o zero we get the more accurate is our model. If the number is large, then the problem is not the algorithm, it is the data. So we could not pick another algorithm to make it much better. There is really only one algorithm used for LR, finding the least squares error. (There are more esoteric ones.) If MSE number is large then either the data is not correlated or, more like, most of the data is correlated, but some of it is not and is thus messing up our model. What we would do is drop some columns out and rebuild out model to get a more accurate model.

What value means the model is good? The model is good when the distribution of errors is a normal distribution, i.e., the bell curve.

Put another way, click Explore Model Performance.

See the histogram above. Numbers to the left of the dotted line are where the predicted values were less than the observed ones. Numbers to the right are where they are higher. If this distribution were entered on the number 0 then we would have a completely random distribution. That is the idea situation where our errors are distributed randomly. But since it is shifted there is something in our data that we should leave out. For example, family size might not be correlated to grades.

Above Amazon showed the RMSE baseline. This is what the RMSE would be if we could have an input data set in which there was this perfect distribution of errors.

Also here we see the limitations of doing this kind of analysis in the cloud. If we have written our own program we could have calculated other statistics that showed exactly which column was messing up our model. Also we could try different algorithms to get rid of the bias caused by outliers, meaning numbers far from the mean that distort the final results.

Run the Prediction

Now that the model is saved, we can use it to make predictions. In other words we want to say given these student characteristics what are their likely final grades going to be.

Select the prediction datasource you created above then select Generate Batch Predictions. Then click through the following screens.

Click review then create ML model.

Here we tell it where to save the results in S3. There it will save several files. The one we are interested in is the one where it calculates the score. It should tack it onto the input data to make it easier to read. But it does not. So I have pasted it into this spreadsheet for you on the sheet called prediction and added back the column headings. I also then added a column to show how the MSE mean squared error is calculated.

As you can see, it saves the data in S3 in a folder called predictions.csv. In this case it gave the prediction values in a file with this long name bp-ebhjggKYchO-grades249.csv.gz. You cannot view that online in S3. So download it showing the URL below and look at it with another tool. In this case I pasted the data into Google Sheets.

Download the data like this:

wget https://s3-eu-west-1.amazonaws.com/gradesml/predictions.csv/batch-prediction/result/bp-ebhjggKYchO-grades249.csv.gz

Here is that the data looks like with the prediction added to the right to make it easy to see. Column AG is the student’s actual grade. AH is the predicted value. AI is the square of the difference. And then at the bottom is MSE.

Intro to Amazon Machine Learning with Logistic Regression

Walker Rowe — Wed, 07 Mar 2018 10:49:31 +0000

Here we look at Amazon’s Machine Learning cloud service. In this first article we will look at logistic regression. In future blog posts we will see what other algorithms it offers.

Remember that logistic regression is similar to linear regression. It looks at a series of independent variables and calculates one dependant variable. If the probability of that outcome is > 50%, the that is classified as a 1 (true). Otherwise it is false (0). (Amazon lets you change that threshold, which is a little strange, as 50% is the standard value used by statisticians. But you could fiddle around with that nevertheless, such as when, for example, 30% means true in your situation.)

Here is related reading if you are just getting started:

Explanation of the Process

The idea behind Amazon ML is that you can run predictive models with without any programming. That is true for logistic regression. But you still need to put your data into a .csv format. Then you upload it to Amazon S3, which is their file storage system.

Here we run logistic regression using the sample banking.csv data set provided by Amazon. The goal is to predict whether a customer is likely to buy the banking service given the attributes shown below:

{
  "version" : "1.0",
  "rowId" : null,
  "rowWeight" : null,
  "targetAttributeName" : "y",
  "dataFormat" : "CSV",
  "dataFileContainsHeader" : true,
  "attributes" : [ {
    "attributeName" : "age",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "job",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "marital",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "education",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "default",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "housing",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "loan",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "contact",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "month",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "day_of_week",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "duration",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "campaign",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "pdays",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "previous",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "poutcome",
    "attributeType" : "CATEGORICAL"
  }, {
    "attributeName" : "emp_var_rate",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "cons_price_idx",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "cons_conf_idx",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "euribor3m",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "nr_employed",
    "attributeType" : "NUMERIC"
  }, {
    "attributeName" : "y",
    "attributeType" : "BINARY"
  } ],
  "excludedAttributeNames" : [ ]
}

When you load this data set into ML, Amazon walks you through each field. It looks at each and determines whether they are numeric (could be any number), categorical (a specific set of numbers or text values), or binary (y or n or 1 or 0). The binary answers the question of whether this customer has pushed the banking product. That is the value we want to predict.

To use this, you need to do is to put your data into a spreadsheet format, with the first row as column headers. Unlike writing code yourself, where you have to convert all values to number, the algorithm here lets you use text or numeric values. Amazon will then take a guess as to which is the dependant variable and ask you to confirm that.

Then Amazon does what any ML programmer would do. It splits the input data set into a training data set and a test data set. It uses a 70/30 split, meaning 70% for one data set and 30% for another. Then it evaluates the model, meaning shows how accurately the independent variables predict the dependant ones.

It could be that there is not much relationship at all between these variables. That would mean your assumption that this data is correlated is wrong. Of course, Amazon picked this banking data because it is correlated.

Having done the model correlation and evaluation, you can now use the trained model to run a prediction. In other words you go get some new data and run your prediction on whether this batch of persons might buy your banking product. Here Amazon charges you. They charged me $2.90 to do this.

Getting Started

Now we show how to use the service.

First you sign into the service by clicking on the Amazon AWS Console and click on Machine Learning to add that service to your account. Note that this service is not free. So set up a billing alert so that you do not get changed more than you have budgeted for.

Building the Model

Here are the steps to build and use the model. We do not go in any particular order. Do not worry as Amazon has wizards to guide you through the process.

You can see how accurate the model is by the AUC (area under the curve). Don’t worry about the exact definition. Unless you are a mathematician or statistician you will not understand it. Just understand that it is the difference between the observed values and predicted values. If they value was 1 then he model is perfect. 0.936 is a very high level of correlation. Anything below 0.5 is deemed to indicate that the data is not sufficiently correlated. In other words, that would mean your assumption of whether a customer might buy this banking product has nothing to do with those input values.

The ML Dashboard

Below is my dashboard showing what I have run. It’s all the same model, but each time I used different datasets. One is prediction and the others Amazon generated automatically when it did the training and evaluation steps.

Here is the screen to kick off the prediction step. Most people would do Generate Batch Predictions. That runs the model against data you have loaded into S3. Real-Time Predictions lets you type one record into a screen and it will run a prediction against that.

Here are the prediction results. As you can see it charged me $2.90, which is $0.10 per 1,000 predictions. It saves the results in S3, which we show below.

Load Data in S3

Amazon’s banking data is already at a URL where you can use it. In other to use Amazon’s data to run a prediction against it, which in real life you would do by gathering more data about your customers, you need to create a bucket in S3. That is like a folder. Below I create the bucket walkerbank.

Wait and Wait some More

It will take some time for your model to run as it gets in a queue behind other customers. Below you can see that this one is in a pending state.

Get the Results

Amazon saves the results in S3. You cannot really browse the results online. Instead you can download the file, unzip it, and then look at it. That is what I have done here.

Here is what Amazon has calculated. Too bad it put the results in a new file instead of appending the prediction as a new column in the input file. Below what we see is the actual value (trueLabel) from the input data and the predicted value (bestAnswer) based upon the model that Amazon built.

trueLabel,bestAnswer,score
0,0,1.437033E-2
0,0,1.139906E-2
1,1,8.305257E-1
0,0,8.966137E-2
1,0,4.096018E-1
0,0,3.634616E-3
0,0,2.641097E-2
0,0,3.487612E-2
1,1,5.777377E-1
0,0,4.469287E-2
0,0,2.456573E-3
0,0,4.300581E-1
1,0,8.399929E-2
0,0,1.024602E-2

Next Steps

In the next blog post we will see whether Amazon can do k-mean clustering, linear regression, or other types of analysis.

Amazon Machine Learning and Analytics Tools

Walker Rowe — Fri, 02 Mar 2018 08:23:02 +0000

Here we begin our survey of Amazon AWS cloud analytics and big data tools. First we will give an overview of some of what is available. Then we will look at some of them in more detail in subsequent blog posts and provide examples of how to use them.

Amazon’s approach to selling these cloud services is that these tools take some of the complexity out of developing ML predictive, classification models and neural networks. That is true, but could it be limiting.

In other words, linear and logistic regression and especially neural networks (used for deep learning) are not for the faint of heart. Amazon ML picks the algorithm for the user. Data scientists are used to being able to do that and to modifying the parameters for said model. Amazon would say users do not need to do that, as their algorithms will do what the data scientist does, which is to change the parameters automatically to reduce the error rate (i.e., the difference between predicted and observed values) to their lowest value.

The Amazon Machine Learning and Analytics Console

When you log into the Amazon AWS console it presents this list of items. You can pick from these and add them to your account. But be careful as the meter starts running when you do that.

The Machine Learning tools at the top are mainly geared toward voice, image, and text analysis, except for the Machine Learning model. (We already wrote about how to use Google Natural Language API here.)

Voice and image recognition is not really a business application. It is a service that you can program yourself using neural networks and train that with publicly downloadable free datasets or pay Amazon to use their cloud. But most people doing analytics in their daily jobs are not going to be interested in creating their own Apple Siri or other voice recognition type system.

Instead they are more like to benefit from less esoteric tools like Athena or QuickSight. QuickSight, in fact, is so easy the end user could use it. So you could set that up and let your end users play with it, thus freeing up your data science resources, somewhat.

Athena lets you wrap a schema around any data in Amazon S3 (i.e., one of their cloud storage products) and then run queries against that.

So let’s look at a couple of these products briefly to see which might be of use to you.

First, a word. Anyone who uses Amazon EC2 knows that subscription fees can mount quickly. Amazon says that for most of the products listed on the AWS management console there are no up-front fees and you pay as you go. But note that the tutorials are not including in the free tier pricing, so watch your subscription fees. (You can create billing alerts on your Amazon account. So here would be a good place to stop and do that.)

Amazon Product	Overview
Machine Learning	ML does what a Python or Scala programmer using Spark or similar language and platform would do using Spark ML, TensorFlow, or other API. To use it, the user uploads data to Amazon S3 or Redshift. Then Amazon splits that into testing (30%) and training (70%) data sets. Then it builds predictive models and shows the results without requiring coding. But you do need to write a program to put your data into a format that ML can understand. In other to use the Amazon ML models you create what data science programmers would call a feature-label matrix. You do this by putting some feature that you want to predict (i.e., the dependent variable) and labels (i.e., independent variables) into a file. In other words, you might assume that sales are a function of price, time of year, advertising budget etc. Then Amazon ML will pick the best model to run predictions or your sales given the price, time of the year, advertising budget, etc. Best means the model that produces the most accurate results. To put that into data science terms, it will run logistic or linear regression against the data. But it does this automatically without you having to write any code.
Athena	This lets you wrap a schema around data loaded into Amazon S3 and run queries against it.
Elasticsearch Service	This is really just a cloud way of running ElasticSearch. So it is infrastructure and not a product. You would probably spend less money by installing your own system as all you need are virtual machines. Beside you will have to configure logstash and filebeats and the other connectors yourself and connect those to your applications and infrastructure systems, like web servers, application servers, firewall logs, and security detection tools. ElasticSearch is usually called ELK, for ElasticSearch, Kibana, and LogStash, as these 3 products are designed to work together. They are the most popular tool for gathering log data across an enterprise.
SageMaker	SageMaker uses notebooks. iPython (now called Jupyter) and Zeppelin are notebooks that data scientist have long used. These are interactive web pages where you can write Scala or Python or other code to query Spark and other data stores. Then you can hide that code and publish it is live web pages that your users can view.
DeepLens	Is an actual physical video camera and cloud service to do image recognition. Connects to SageMaker and other Amazon tools like Amazon Kinesis Video Streams.

Google Natural Language API and Sentiment Analysis

Walker Rowe — Wed, 14 Feb 2018 00:00:48 +0000

Here we discuss Natural Language processing using the Google Natural Language API. Our goal is to do sentiment analysis.

Definition

Sentiment analysis means seeing whether what someone writes is positive or negative. Business can use this to look at Twitter, Yelp, or whoever offers a API and then change their marketing, practices, or even reach out to the person who has complained about their brand and offer to fix what has irked them so.

To do sentiment analysis you could write your own code or use any of the many cloud APIs from different vendors and pay for the service. Writing it yourself would save you money. You just need to understand concepts like bag of words and master the NLP APIs in a deep learning ML library like Torch.

Sign up for a Google Cloud free trial here and enable the Google Natural Language Processing (NLP) API. This gives you $300 credit that you can use in 365 days. Don’t worry, they promise they will not bill your credit card without asking you first. So you should be able to use it for free for this tutorial and other tutorials that we will write on Google Cloud.

Setup

You need to install the Google cloud utilities on your system.

Download the Google cloud utilities and SDK following these instructions to set it up, but STOP at the gcloud init step since we are using the NLPTutorial, which is already partially set up.

Enable API for project

Go to the Google console and pick the NLPTutorial project, which should already be there. if it is not you have not signed up and enables the NLP trial.

Go to Enable APIs and Services.

Enable both Natural Language processing and Google Service Mangement APIs by typing the first few letters in this screen and picking each:

If you cannot find that screen go to the URL directly: //console.developers.google.com/apis/library/cloudresourcemanager.googleapis.com/?project=(name it shows on as Project id) like this:

If the API is setup correctly, the screen should look like this:

Download a service account key in JSON format following these directions.

Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to it:

export GOOGLE_APPLICATION_CREDENTIALS=/home/walker/Documents/nlp/google.json

Run Sentiment Analysis

Now we will test this by writing a complaint about a restaurant we visited. Type:

gcloud ml language analyze-sentiment --content="The service in your restaurant is terrible. My wife and I are never coming back."

As you can see the sentiment is negative (< 0). To understand how negative you need to study what magnitude and score mean.

{
  "documentSentiment": {
    "magnitude": 1.2,
    "score": -0.6
  },
  "language": "en",
  "sentences": [
    {
      "sentiment": {
        "magnitude": 0.9,
        "score": -0.9
      },
      "text": {
        "beginOffset": 0,
        "content": "The service in your restaurant is terrible."
      }
    },
    {
      "sentiment": {
        "magnitude": 0.3,
        "score": -0.3
      },
      "text": {
        "beginOffset": 45,
        "content": "My wife and I are never coming back."
      }
    }
  ]
}

Now write something positive:

gcloud ml language analyze-sentiment --content="We love your restaurant and will recommend it to all our friends."

Now the sentiment is positive (> 0).

{
  "documentSentiment": {
    "magnitude": 0.9,
    "score": 0.9
  },
  "language": "en",
  "sentences": [
    {
      "sentiment": {
        "magnitude": 0.9,
        "score": 0.9
      },
      "text": {
        "beginOffset": 0,
        "content": "We love your restaurant and will recommend it to all our friends."
      }
    }
  ]
}

Next you could try to the same thing using Python following the instructions provided by Google.

Machine Learning – BMC Software | Blogs

Big Data vs Data Analytics vs Data Science: What’s The Difference?

Data TL:DR

What is big data?

Characteristics of big data

Types of big data

Big data systems & tools

What is data analytics?

Types of analytics

Accuracy of data analytics

Data analytics tools & technologies

What is data science?

Data science tools & technologies

Data is the future

Related reading

Top Machine Learning Frameworks To Use

What’s an ML framework?

Choosing your ML tool

Popular machine learning frameworks

TensorFlow

PyTorch

scikit-Learn

Spark ML

Torch

New ML framework types

huggingface.co

Keras

Additional resources

Tuning Machine Language Models for Accuracy

The Code

Introduction to Google Cloud Machine Learning Engine

Getting Started

Pricing

Required Skills

What Can you do with Google Cloud ML

AWS Linear Learner: Using Amazon SageMaker for Logistic Regression

Amazon SageMaker: A Hands-On Introduction

What you Need to Know

Pricing

SageMaker Notebook

KMeans Clustering

Amazon S3

Basic Approach

The Code

The SageMaker Dashboard

Additional Resources

Linear Regression with Amazon AWS Machine Learning

Get Some Data

Run the Prediction

Intro to Amazon Machine Learning with Logistic Regression

Explanation of the Process

Getting Started

Building the Model

The ML Dashboard

Load Data in S3

Wait and Wait some More

Get the Results

Next Steps

Amazon Machine Learning and Analytics Tools

The Amazon Machine Learning and Analytics Console

Google Natural Language API and Sentiment Analysis

Definition

Setup

Enable API for project

Run Sentiment Analysis