scikit Learn Guide – BMC Software | Blogs

How to Create a Machine Learning Pipeline

Walker Rowe — Tue, 07 Jan 2020 00:00:45 +0000

In this article, I’ll show you how to create a machine learning pipeline. In this example, we’ll use the scikit-learn machine learning framework (see our scikit-learn guide or browse the topics in the right-hand menu). However, the concept of a pipeline exists for most machine learning frameworks.

To follow along, the data is available here, and the code here.

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

What is a machine learning pipeline?

Generally, a machine learning pipeline describes or models your ML process: writing code, releasing it to production, performing data extractions, creating training models, and tuning the algorithm. An ML pipeline should be a continuous process as a team works on their ML platform.

Machine learning programs involve a series of steps to get the data ready before feeding it into the ML model. Those steps can include:

Reading the data and converting it to a Pandas dataframe
Dropping or adding some columns
Running some calculations over the columns
Normalizing the data

You can use the Pipeline object to do this one step after another. Let’s look at an example.

ML pipeline example using sample data

We have looked at this data from Trip Advisor before. As you can see, the data is a combination of text and numbers. In a machine learning model, all the inputs must be numbers (with some exceptions.) So, we will use a pipeline to do this as Step 1: converting data to numbers.

We’ll also use the pipeline to perform Step 2: normalizing the data. That means for each data point x we calculate the new value z = x – (average) / (standard deviation). This makes all large numbers small, which is useful because ML models work best when the inputs are normalized.

We start by importing the data using Pandas. Then, make an array of the non-numeric columns that we will convert to numbers.

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('/home/ubuntu/Documents/TripAdvisor.csv',sep=',',header=0)

cols = ['User country',  'Period of stay', 'Traveler type', 'Pool', 'Gym',
       'Tennis court', 'Spa', 'Casino', 'Free internet', 'Hotel name',
       'User continent',
       'Review month', 'Review weekday','Hotel stars']

We do this by using the Pandas factorize() function. This takes text and converts that to categorical values, meaning just turn them into them into unique numbers.

The pipeline method works like this:

You stack up functions in the order that you want to run them. These are called transformers. Below we create a custom transformer then use one built into scikit-learn.

The format is:

Pipeline constructor with tuples of (‘a descriptive name’, a function). You can pass arguments to the first function’s init() method where it says some args. Each method must implement the fit() and transform() functions. Except the last function only implements fit().

pipeline = Pipeline([ 
    ('someName', someFunction(some args)),
    ('next Name', nextFuntion())
])

data = pipeline.fit_transform(dataframe)

Here is the first function that we call. It uses the BaseEstimator and TransforMixin objects, which saves us writing some code. For example, it creates a fit_transform() method for us and creates getters and setters that we can use to pass in other parameters.

We pass in the columns we want to convert to numbers in the init() constructor. The X object that is created for us to contained dataset that we pass to the transformer: pipeline.fit_transform(dataframe).

Then we loop over each column in the cols array and change those using factorize().

Finally we return the Pandas dataframe as a NumPy array using X.values, since machine learning models work with NumPy arrays, not dataframes (in most cases)

from sklearn.base import BaseEstimator, TransformerMixin

class ToNumbers(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        for c in cols:
            encoded, categories = X[c].factorize()
            X[c] = encoded
        return X.values

The second step calls the StandardScaler() to normalize the values in the array. We don’t have to pass it any arguments since it knows to use the data from the previous step.

pipeline = Pipeline([ 
    ('toNumbers', ToNumbers(cols)),
    ('scaler', StandardScaler())
])

data = pipeline.fit_transform(df)

Here is the converted data as a NumPy array.

data

array([[-0.51293975, -0.49559486, -0.50236994, ...,  0.10211954,
        -1.59325501, -1.62074647],
       [-0.51293975,  0.94590453,  0.20791152, ...,  0.02768968,
        -1.59325501, -1.1089318 ],
       [-0.51293975, -0.16191445, -0.29346363, ...,  0.0152847 ,
        -1.30357228, -0.59711712],
       ...,
       [-0.51293975,  1.41305711,  0.29147405, ...,  0.04009466,
         1.30357228, -1.62074647],
       [-0.51293975, -0.5222893 , -0.41880742, ...,  0.10211954,
         1.59325501,  0.42651223],
       [-0.51293975, -0.37546991,  0.124349  , ...,  0.05249964,
         1.59325501, -0.08530245]])

Outlier and Anomaly Detection with Machine Learning

Walker Rowe — Thu, 21 Nov 2019 00:00:44 +0000

In this article, we’ll explain how to do outlier detection. Outlier detection is important in a variety of scenarios: credit card fraud, cyber security, and to detect, for example, faulty equipment in systems.

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

The approach

The simplest approach for outlier detection is to assume a normal distribution and then set a threshold at some number of standard deviations. That’s called the z-score.

In this example, we’re using a different approach—an isolation forest. It’s a type of decision tree that picks values at random, splits each data point between a randomly selected range, then picks another randomly picked value on an ever-decreasing range of values until it runs out of values to split. Lastly, it flags as outliers those that are on the shortest path in that tree. This flags outliers by calculation an anomaly score.

In the sample below we mock sample data to illustrate how to do anomaly detection using an isolation forest within the scikit-learn machine learning framework.

The code, explained

The code for this example is here.

First, we pull 100 samples from a standard normal distribution with mean 4 and standard deviation 1 to create a 100×2 matrix:

train = rng.normal(4,1,(100,2))

Now, train that model:

from sklearn.ensemble import IsolationForest

clf = IsolationForest(behaviour='new', max_samples=100,
                      random_state=rng, contamination='auto')

clf.fit(train)

Next, we create another sample.

typical =  rng.normal(4,1,(100,2))

If we print that we see that the values range between 2 and 6. So, we create outliers of values 8 and 1:

outliers = np.array([
                     [8,6],
                     [1,4]
])

Here is the complete code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(20)

train = rng.normal(4,1,(100,2))

 
typical =  rng.normal(4,1,(100,2))
 
outliers = np.array([
                     [8,6],
                     [1,4]
])
                       
 
clf = IsolationForest(behaviour='new', max_samples=100,
                      random_state=rng, contamination='auto')

clf.fit(train)

pred_train = clf.predict(train)
pred_test = clf.predict(typical)
pred_outliers = clf.predict(outliers)

Here is a sample of the data. It ranges between above 1 and below 6, so we will mock data above and below 1 and 6, which the algorithm should flag as outliers.

typical

array([[2.49602218, 5.31396783],
       [3.96987389, 4.63501308],
       [4.50243523, 5.16196591],
       [3.69425171, 5.71228277],
       [5.169437  , 3.72418513],
       [1.48821929, 3.58272819],
       [4.54157245, 3.8548927 ],

Then we run three predictions and plot them, plotting the second column in each matrix [:1] on the x columns and the first column on the y axis [:0], just as an easy way to visualize the pairs. Since they are drawn from a normal distribution, they cluster around the middle.

pred_train = clf.predict(train)
pred_test = clf.predict(typical)
pred_outliers = clf.predict(outliers)

fig, axis = plt.subplots(2)

axis[0].scatter(train[:,1], train[:,0] )

axis[1].scatter(typical[:,1], typical[:,0]  , c='g'  )
axis[1].scatter(outliers[:,1], outliers[:,0] , c='r'  )

The points [8,6] and [1,4] are outliers which is easy to see, since we plot them in red.

scikit-learn Classification Tutorial

Walker Rowe — Thu, 17 Oct 2019 00:00:46 +0000

Here we show how to use scikit-learn. The code for this example is here. Download the data from Kaggle here.

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

Which machine learning framework should you use?

Before we show you how scikit-learn works, it’s work discussing which ML framework to use. I put this up front because too many people starting data science think they must start with TensorFlow, but that is overkill for most of your problems. For low-level approaches, which TensorFlow specializes in, it is too complicated; there are easier approaches.

scikit-learn

TensorFlow

Keras

Spark ML

This general-purpose ML framework is both easy to use and can tackle most ML problems.

It is very popular among data scientists. Even data scientists who use other frameworks often deploy scikit-learn utilities in part of their code.

TensorFlow is designed for one purpose: neural networks. It is very low level, which means you’ll need a lot of knowledge about NumPy arrays and neural network theory.

Google sells special CPUs, called TPUs, which are designed to process tensors at very large scale. So you can see it’s designed for computing intensive tasks, like facial recognition, but most business problems are less complicated than facial recognition.

We’ve written tutorials on how to use ML with TensorFlow & Keras here.

If you want to use TensorFlow, then use Keras, as it acts as a front end, thus making it a lot easier. You’ll be less likely to make mistakes that produce wrong answers.

Keras also works in front of other popular ML frameworks, also making those easier to use.

We explain how to use Keras here.

scikit-learn is designed to run on one server. If you have a large amount of data, you might want to use Spark ML, as it’s designed to run across a cluster.

And Spark ML is easy-to-understand.

All the Python ML frameworks start pretty much the same, starting with the same tools:

Pandas⁠. This organizes csv, json, Spark, and other types of data into rows and columns. Pandas greatly simplifies all types of data, but its advanced features can get complicated.
NumPy⁠. This is tied closely to both Pandas and matplotlib. NumPy performs best when handling the most important machine learning task: the computationally expensive operation of multiplying matrices in multiple dimensions. As those grow, they can quickly run your computer out of memory.
Matplotlib. This draws charts, like histograms, line charts, etc. Charting data is a good way to explore data while you work with the data, and they can illustrate your resulting conclusions at the end of your program.
Seaborn⁠. This framework, on top of matplotlib, is designed specifically for data science.

The data

The data we are looking at is glucose, body mass index, etc. taken from two sets of people: those who are diabetic and those who are not. That classification 1 (diabetic) and 0 (not) is in the Outcome column. The goal is to use that data to train a predictive model that will show given certain health indicators whether or not a person is likely to have or will get diabetes.

The algorithm

There are a lot of ways to approach a classification problem, like logistic regression or even neural networks. Here we use the Support Vector Machine (SVM).

The first step is to read the data into a dataframe.

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

data = pd.read_csv('/home/ubuntu/Downloads/diabetes.csv', delimiter=',')

Then, take a look at it.

data.head()

Machine learning requires that you split the data into features and labels.

Features are the characteristics of what you are looking at, also known as the independent variables.
Labels are what you are trying to predict, aka the dependent variables.

Classification means there are a finite set of outcomes. Here there are two, so you could call it a binary classification problem.

As you can see, the outcome, whether someone has diabetes or not, is the last column. So, the rest are features. It will be easy to split this data since the labels are on the end.

Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1

The Pandas drop() command means to create a new dataframe by taking an existing dataframe and dropping one or more columns. axis=1 means we are referring to the columns and not the rows (which for Pandas is aka the index).

x = data.drop("Outcome", axis=1)

data[‘Outcome’] is a Pandas Series and not a Pandas dataframe. This means it has one column only, but it still has the index column. np.ravel() will flatten that to an array.

y=np.ravel(data['Outcome'])

The standard procedure is to take the input data and create training and test datasets by splitting them by some amount. Here we pick 50%.

Training datasets are used to train the model.
Test datasets are used to make predictions based on that trained model.

We use x and y since the familiar equation for a line y = mx + b. For machine learning y is a vector and m and x are matrices, meaning an n-dimensional vector. b is bias, which is a single real number.

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.5, random_state=50)

Now we normalize the data. Basically, this calculates the value (( x – μ) / δ ) where μ is the mean and δ is the standard deviation. This puts all the features on the same scale, which is a regular machine learning practice. In other words, it makes large numbers small so that all the numbers are about the same size.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(x_train)

x_train = scaler.transform(x_train)

x_test = scaler.transform(x_test)

First, we declare the model. We are using a support vector machine.

from sklearn.svm import SVC
svc_model = SVC()

Then we train it: it’s that simple when you use scikit-learn. There’s no other data manipulation required.

svc_model.fit(x_train, y_train)

The fit() function responds with this information.

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Now we use the test data to create predictions.

y_predict = svc_model.predict(x_test)

Then we show how accurate those predictions are by creating what is called a confusion matrix. This is a visual way to see how many times the model was right versus how many times it was incorrect.

from sklearn.metrics import classification_report, confusion_matrix

cm = np.array(confusion_matrix(y_test, y_predict, labels=[0,1]))

confusion = pd.DataFrame(cm, index=['Diabetic', 'Not Diabetic'], columns=['Predicted Diabetes', 'Predicted Healthy'])

                        Predicted Diabetes  Predicted Healthy
Diabetic                     225                 23
Not Diabetic                  68                 68

It’s a little difficult to understand that display at first, so think of it like this:

Diabetic (Outcome = 1)	True positive. Patient is diabetic and model correctly predicted that.  	False positive. Patient was not diabetic but model said patient was diabetic.

Not Diabetic (Outcome = 0)	False positive. Patient was not diabetic, but model said patient was diabetic.
	True negative, patient not diabetic and model predicted that.

Here is a graphical way to show the same results using the powerful Seaborn extension to Matplotlib:

sns.heatmap(confusion,annot=True,fmt='g')

The classification report prints a summary of the model, showing a 77% precision. This means our model accurately predicts diabetes 77% of the time.

print(classification_report(y_test, y_predict))

          precision    recall  f1-score   support

           0       0.77      0.91      0.83       248
           1       0.75      0.50      0.60       136

   micro avg       0.76      0.76      0.76       384
   macro avg       0.76      0.70      0.72       384
weighted avg       0.76      0.76      0.75       384

The results were nearly the same as when we used Keras as a neural network.

scikit-learn Permutation Importance

Walker Rowe — Wed, 03 Oct 2018 00:00:21 +0000

In this post, I’ll show why people in the last U.S. election voted for Trump, which is the same as saying against Clinton because the fringe candidates hardly received any votes, relatively speaking.

The technique here handles one of the most vexing questions in black-box classifier and regression models: Which variables should you remove from a regression model to make it more accurate?

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

How permutation importance works

Within the ELI5 scikit-learn Python framework, we’ll use the permutation importance method. Permutation importance works for many scikit-learn estimators. It shuffles the data and removes different input variables in order to see relative changes in calculating the training model. It also measures how much the outcome goes up or down given the input variable, thus calculating their impact on the results.

In other words, for linear regression, it first calculates, for example, the coefficients α, β, γ, …

y = αa + βb + γc + … + ωz + B

And then tests the model using cross entropy, or another technique, then calculating r2 score, F1, and accuracy. Finally, the model drops one of a, b, c, … and runs it again.

There are two downsides to this method:

Other ML frameworks do not support this.
Its output is an HTML object that can only be displayed using iPython (aka Jupyter).

Running Jupyter

Nothing can be easier that running Jupyter—it is easier to set up that Zeppelin, which itself requires little setup. Simply install Anaconda and then, on Mac, type jupyter notebook. It will open this URL in the browser http://localhost:8889/tree.

Clinton v. Trump

First, get your U.S. election data here. These summaries are for every county in every state in the U.S. The code we write is stored here.

We use the read_csv Pandas method to read the election data, taking only a few of the columns. You could add more columns to find what other variables correlate with the voter’s choice.

At the bottom is the complete code. Here we explain different sections. But first, here are the results in both HTML and text format.

So you can see the columns in the data frame by their index, here they are are:

Index(['Trump', 'age65plus', 'SEX255214', 'White', 'Black', 'Hispanic',
       'Edu_highschool'],
      dtype='object')

Here are the relative weights.

column name	Weight	Feature
age65plus	0.5458 ± 0.0367	x0
Hispanic	0.3746 ± 0.0348	x4
White	0.2959 ± 0.0159	x2
SEX255214	0.0323 ± 0.0064	x1
Edu_highschool	0.0272 ± 0.0038	x5
Black	0.0207 ± 0.0023	x3

The graphic is shown in the iPython notebook as follow:

As you can see, the decision whether to vote for Trump is mainly by age, with voters 65 and over most closely correlated to the outcome. Surprisingly, gender does not matter much.

The code explained

In another blog, we explain how to perform a linear regression. The technique is the same here, except we use more than one independent variable, i.e., x.

We take as the independent variables xx, everything but Trump, which is the dependent variable, yy. A vote for Trump is a vote not for Hillary. So the output for the yy variable should the same, or similar, but it won’t be exactly the same as yy <> 1 – xx in the data.

We use the values properties of the dataframe to convert that to a NumPy array as that it what the fit method of LR requires.

xx = train[['White','Black','Hispanic','age65plus','Edu_highschool','SEX255214']].values
yy = train[['Trump']].values

We do not need to reshape the arrays, as the dimensions fit the requirement that they can be paired up. xx has 3112 rows and 6 columns. yy is 3112 x 1.

xx.shape is (3112, 6)
yy.shape is (3112, 1)

Next we run the fit method of linear_model. Then we print the coefficients:

Coefficients: 
 [[ 0.51503519 -0.13783668 -0.44485146  0.00352575 -0.00985589 -0.00801731]]

Then comes the grand finale—running the fit method of PermutationImportance, followed by drawing the graph. This shuffles the input variables and runs linear regression multiple times, and then calculates which independent variables have the greatest impact on the calculation of the coefficients and thus y,

perm = PermutationImportance(reg, random_state=1).fit(xx, yy)
eli5.show_weights(perm)

Here is the complete code.

import matplotlib.pyplot as plt
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import eli5
from eli5.sklearn import PermutationImportance

train = pd.read_csv('/Users/walkerrowe/Documents/keras/votes.csv',usecols=['Trump','White','Black','Hispanic','age65plus','Edu_highschool','SEX255214'])
 
xx = train[['White','Black','Hispanic','age65plus','Edu_highschool','SEX255214']].values
yy = train[['Trump']].values
reg = linear_model.LinearRegression()
model=reg.fit(xx,yy)
print('Coefficients: \n', reg.coef_)
perm = PermutationImportance(reg, random_state=1).fit(xx, yy)
eli5.show_weights(perm)

Mean Square Error & R2 Score Clearly Explained

Walker Rowe — Thu, 05 Jul 2018 00:48:43 +0000

Today we’re going to introduce some terms that are important to machine learning:

Variance
r2 score
Mean square error

We illustrate these concepts using scikit-learn.

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

Why these terms are important

You need to understand these metrics in order to determine whether regression models are accurate or misleading. Following a flawed model is a bad idea, so it is important that you can quantify how accurate your model is. Understanding that is not so simple.

These first metrics are just a few of them. Other concepts, like bias and overtraining models, also yield misleading results and incorrect predictions.

(Learn more in Bias and Variance in Machine Learning.)

To provide examples, let’s use the code from our last blog post, and add additional logic. We’ll also introduce some randomness in the dependent variable (y) so that there is some error in our predictions. (Recall that, in the last blog post we made the independent y and dependent variables x perfectly correlate to illustrate the basics of how to do linear regression with scikit-learn.)

Getting started with AIOps is easy. Learn how you can manage escalating IT complexity with ease! ›

What is variance?

In terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean. The goal is to have a value that is low. What low means is quantified by the r2 score (explained below).

In the code below, this is np.var(err), where err is an array of the differences between observed and predicted values and np.var() is the numpy array variance function.

What is r2 score?

The r2 score varies between 0 and 100%. It is closely related to the MSE (see below), but not the same. Wikipedia defines r2 as

” …the proportion of the variance in the dependent variable that is predictable from the independent variable(s).”

Another definition is “(total variance explained by model) / total variance.” So if it is 100%, the two variables are perfectly correlated, i.e., with no variance at all. A low value would show a low level of correlation, meaning a regression model that is not valid, but not in all cases.

Reading the code below, we do this calculation in three steps to make it easier to understand. g is the sum of the differences between the observed values and the predicted ones. (ytest[i] – preds[i]) **2. y is each observed value y[i] minus the average of observed values np.mean(ytest). And then the results are printed thus:

print ("total sum of squares", y)
print ("ẗotal sum of residuals ", g)
print ("r2 calculated", 1 - (g / y))

Our goal here is to explain. We can of course let scikit-learn to this with the r2_score() method:

print("R2 score : %.2f" % r2_score(ytest,preds))

What is mean square error (MSE)?

Mean square error (MSE) is the average of the square of the errors. The larger the number the larger the error. Error in this case means the difference between the observed values y1, y2, y3, … and the predicted ones pred(y1), pred(y2), pred(y3), … We square each difference (pred(yn) – yn)) ** 2 so that negative and positive values do not cancel each other out.

The complete code

So here is the complete code:

import matplotlib.pyplot as plt
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

reg = linear_model.LinearRegression()
ar = np.array([[[1],[2],[3]], [[2.01],[4.03],[6.04]]])
y = ar[1,:]
x = ar[0,:]
reg.fit(x,y)
print('Coefficients: n', reg.coef_)
xTest = np.array([[4],[5],[6]])
ytest =  np.array([[9],[8.5],[14]])

preds = reg.predict(xTest)
print("R2 score : %.2f" % r2_score(ytest,preds))
print("Mean squared error: %.2f" % mean_squared_error(ytest,preds))

er = []
g = 0
for i in range(len(ytest)):
    print( "actual=", ytest[i], " observed=", preds[i])
    x = (ytest[i] - preds[i]) **2
    er.append(x)
    g = g + x
    
x = 0
for i in range(len(er)):
   x = x + er[i]

print ("MSE", x / len(er))

v = np.var(er)
print ("variance", v)

print ("average of errors ", np.mean(er))

m = np.mean(ytest)
print ("average of observed values", m)

y = 0
for i in range(len(ytest)):
    y = y + ((ytest[i] - m) ** 2)

print ("total sum of squares", y)
print ("ẗotal sum of residuals ", g)
print ("r2 calculated", 1 - (g / y))

Results in:

Coefficients: 
 [[2.015]]
R2 score : 0.62
Mean squared error: 2.34
actual= [9.] observed= [8.05666667]
actual= [8.5] observed= [10.07166667]
actual= [14.] observed= [12.08666667]
MSE [2.34028611]
variance 1.2881398892129619
average of errors 2.3402861111111117
average of observed values 10.5
total sum of squares [18.5]
ẗotal sum of residuals [7.02085833]
r2 calculated [0.62049414]

You can see by looking at the data np.array([[[1],[2],[3]], [[2.01],[4.03],[6.04]]]) that every dependent variable is roughly twice the independent variable. That is confirmed as the calculated coefficient reg.coef_ is 2.015.

There is no correct value for MSE. Simply put, the lower the value the better and 0 means the model is perfect. Since there is no correct answer, the MSE’s basic value is in selecting one prediction model over another.

Similarly, there is also no correct answer as to what R2 should be. 100% means perfect correlation. Yet, there are models with a low R2 that are still good models.

Our take away message here is that you cannot look at these metrics in isolation in sizing up your model. You have to look at other metrics as well, plus understand the underlying math. We will get into all of this in subsequent blog posts.

Ready to discover how BMC Helix for ServiceOps can transform your business?

Additional Resources

Extending R-squared beyond ordinary least-squares linear regression from pcdjohnson

Getting Started with scikit-learn

Walker Rowe — Wed, 27 Jun 2018 00:00:00 +0000

Here we explore another machine learning framework, scikit-learn, as well as show how to use matplotlib, to draw graphs. Check out the official site for scikit-learn.

The scikit-learn python ML API predates Apache Spark and TensorFlow, which is to say it has been around longer than big data. It has long been used by those who see themselves as pure data scientists, as opposed to data engineers. Still you can connect scikit-learn to Spark so that the transformations and calculations you can run across a cluster of machines. Without that you can only work with datasets that fit into the memory, cpu speed, and disk space of a single machine.

Scikit-learn is very strong on statistical functions and packed full of almost every algorithm you can think of, including those that only academics and mathematicians would understand, plus neural networks, which is applied ML. But you need not be a mathematician to get started with the product. Here we show how to code the simplest possible example.

(This article is part of our scikit-learn Guide. Use the right-hand menu to navigate.)

Environment

You should use conda to setup Python and install packages. This will make it easy to integrate sckit-learn with Zeppelin Notebooks. It is difficult to add python packages to Zeppelin otherwise.

You could run this example from the Python command line, but Zeppelin is graphical, so you can draw nice graphs with matplotlib.

When you install Zeppelin, install the package with ALL interpreters so that Python, Spark, matplotlib. etc. are there. Follow this example if you are new to Zeppelin to understand that.

Install Anaconda and then enter this command to install scikit-learn:

conda install scikit-learn

Then in Zeppelin point your Python interpreter to the Anaconda Python like this:

Example

We will do the simplest possible example, which is linear regression with 1 independent variable and 1 dependant variable. And we will set up our data so that the data is perfectly correlated, meaning the variance is 1. In other words, with LR the goal is to solve this equation for a straight line by finding the coefficient m and y intercept b.

y = mx + b

But our goal here is just to show how to use scikit-learn. So we will mock our data using y = 2x + 0 = 2x. (If you want a more realistic example you could use the standard normal distribution and draw a random noise factor from that and add that to y.)

The code from this example is stored here as a Zeppelin notebook. Below we show the code in sections. You can paste the sections into Zeppelin paragraphs in a Zeppelin notebook. Be sure to select the python interpreter when you create the notebook.

import matplotlib.pyplot as plt
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

First we instantiate the LinearRegression model.

reg = linear_model.LinearRegression()

Next we make an array. On the left are the independent variables 1,2,3. On the right are the dependant ones 2,4,6. As you can see each dependent variable y is equal to each dependant variable x times 2, i.e., y=2x.

Scikit-learn expects you to use a numpy array.

ar = np.array([[[1],[2],[3]], [[2],[4],[6]]])
ar

Results. Printing out each result helps to visualize and understand it.

array([[[1],
        [2],
        [3]],
       [[2],
        [4],
        [6]]])

Here we slice the array taking just the y variables. (Slicing and reshaping arrays is a complicated topic. You can read about that here.) This array shape, ar.shape, is given in this tuple (2,3,1). This is an array 1 of 2 arrays. So we alternatives take the first 1 x 3 array at the second position for y:

y = ar[1,:]
y

Results:

array([[2],
       [4],
       [6]])

Then take the first one for x.

x = ar[0,:]
x

Results:

array([[1],
       [2],
       [3]])

Now we have the independent variables stored in a simple vector x and the dependant variables in y. In terms of machine learning we say we have 1 feature and 1 label.

Now we run fit() which uses the least squares method to find the line y=mx, by finding the slope m, which is the same as saying the coefficient m for the line y = mx.

reg.fit(x,y)

Now we can print the coefficient from the linear regression API.

print('Coefficients: \n', reg.coef_)

Result is 2, which we would expect since we know the line is y = 2x.

Coefficients: 
 [[2.]]

Now let’s make test data the same way, 4,5,6 and then 2 times that for y:

xTest = np.array([[4],[5],[6]])
xTest

Results:

array([[4],
       [5],
       [6]])

Now feed the x test values xTest into predict method. It will return an array of y prediction.

ytest =  np.array([[8],[10],[12]])

preds = reg.predict(xTest)
preds

Results:

array([[ 8.],
       [10.],
       [12.]])

Now look at the error, which is the difference between the observed values (ytest) and predicted (preds). Of course, it is zero.

print("Mean squared error: %.2f" % mean_squared_error(ytest,preds))

Results:

Mean squared error: 0.00

And the variance is 1, since the data is perfectly correlated.

print("Variance score: %.2f" % r2_score(ytest,preds))

Results

Variance score: 1.00

Now we can plot is using matplotlib:

plt.scatter(xTest,preds, color='black')
plt.plot(xTest,preds,color='blue', linewidth=3)

plt.show()

That is the simplest possible example of how to use scikit-learn ML library to do linear regression.