Introduction to TensorFlow and Logistic Regression

BY

Here we introduce TensorFlow, an opensource machine learning library developed by Google. We explain what it does and show how to use it to do logistic regression.

Background

TensorFlow has many applications to machine learning, including neural networks. One application of neural networks is handwriting analysis. Another is facial recognition. TensorFLow is design to allow such problems to scale without limit as the nodes in the graph can be run across a distributed network. Google uses TensorFlow in some of their production applications.

One interesting aspect about TensorFlow is not only does the logic use the CPU of a machine, it can use the GPU, or graphical processor unit. That provides more power per machine as GPUs typically have a lot of power as powering the computer screen requires speed.

Install and Basic Concepts

To follow this tutorial, first install TF using the directions here.

The basis unit in TensorFlow is the tensor. A tensor is an array of any number of dimensions. For example:

[1] is a 1 dimension array
[[1,1]] is 2 dimension array

To get started, first run Python and import TensorFlow:

import tensorflow as tf

You can assign values directly or make a placeholder where you assign the value later. For example a single value can be written:

x = tf.constant(3.0, dtype=tf.float32)

Where x is an immutable constant (meaning you cannot change it).

But the tensor has no value until you initiate a Session and run it:

import tensorflow as tfsess = tf.Session()
x = tf.constant(3.0, dtype=tf.float32)
print(sess.run([x]))
Outputs:
[3.0]

Or you can write:

import tensorflow as tfsess = tf.Session()y = tf.Variable([3.0], dtype=tf.float32)
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run([y]))Outputs:
[array([ 3.], dtype=float32)]

In the example above, the Variable(s) have no value until you run tf.global_variables_initializer().

You can add tensors and do other math, like this:

x = tf.constant([3,3], dtype=tf.float32)
y = tf.constant([4,4], dtype=tf.float32)
print (x + y)
print(sess.run([x+y]))
outputs:Tensor("add_4:0", shape=(2,), dtype=float32)
[array([ 11., 11.], dtype=float32)]

As you can see, the values of x and y have no value until you call run.

Here is another example. This is the graph of a line f(x)=mx + b, where m is the slope and b the y-intercept.

m = tf.Variable([2], dtype=tf.float32)
b = tf.Variable([3], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = m * x + b

You can pass an array of n values to that and run that function n times. Here we use [1, 2, 3, 4]:

init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(y, {x: [1, 2, 3, 4]}))
Outputs:[ 5. 7. 9. 11.]

Linear Regression with tf.estimator

For background on logistic regression, and interpretation of the results, you can read this document from WikiPedia. We also get our test data from that document. The goal is to predict the likelihood that a student will pass a test given how many hours they have studied.

Copy and paste the code below into the Python interpreter as we explain.

Having installed TensorFlow, now run python.

First we import pandas, as it is the easiest way to work with columnar data. The hours are floating numbers, like x.xx. We multiply them by 100 and convert them to an integer since the TensorFlow functions we used for logistic regression require either strings or integers.

import pandashours = [0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50]passx = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]df = pandas.DataFrame(passx)
df['hours'] = hours
df.columns = ['pass', 'hours']
h = df['hours'].apply(lambda x: x * 100).astype(int)df['hours']=hprint(df)outputs: print(df)
hours pass
0 0.50 0
1 0.75 0
2 1.00 0
3 1.25 0
...

We create a function input_fn that we can pass into the LinearClassifier model below. This function returns a data frame using the tf.estimator.inputs.pandas_input_fn method.

def input_fn(df):
labels = df["pass"]
return tf.estimator.inputs.pandas_input_fn(
x=df,
y=labels,
batch_size=100,
num_epochs=10,
shuffle=False,
num_threads=5)

TensorFlow writes its working data to disk, so we give it a place to do that. And we have to create a NumericColumn object, since our independent variable in continuous and not categorical. Then we create the LinearClassifier model.

import tensorflow as tf
import tempfile
model_dir = tempfile.mkdtemp()
hours = tf.feature_column.numeric_column("hours")
base_columns = [hours]
m = tf.estimator.LinearClassifier(model_dir=model_dir, feature_columns=base_columns)

Now we run the train method.

m.train(input_fn(df),steps=None)Outputs:INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpS8OD2H/model.ckpt.
INFO:tensorflow:loss = 69.3147, step = 1
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmpS8OD2H/model.ckpt.
INFO:tensorflow:Loss for final step: 54.1885.
<tensorflow.python.estimator.canned.linear.LinearClassifier object at 0x7f103b560390>

Use same data for test data set as the training set. In real life you would split them in two. But we have very little data here.

results = m.evaluate(input_fn(df),steps=None)Outputs:INFO:tensorflow:Starting evaluation at 2017-11-02-14:20:16
INFO:tensorflow:Restoring parameters from /tmp/tmpS8OD2H/model.ckpt-10
INFO:tensorflow:Finished evaluation at 2017-11-02-14:20:16
INFO:tensorflow:Saving dict for global step 10: accuracy = 0.75, accuracy_baseline = 0.5, auc = 0.895, auc_precision_recall = 0.907308, average_loss = 0.535767, global_step = 10, label/mean = 0.5, loss = 53.5767, prediction/mean = 0.585759

Here we print out the same results as above but in an easier to read manner.

print("model directory = %s" % model_dir)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
Outputs:accuracy: 0.75
accuracy_baseline: 0.5
auc: 0.895
auc_precision_recall: 0.907308
average_loss: 0.535767
global_step: 10
label/mean: 0.5
loss: 53.5767
prediction/mean: 0.585759

The accuracy could be improved. You could create a larger data set and split the input data into a training and test data set. You could also adjust num_epochs and other values.

Related posts:

Want to Learn More About Big Data and What It Can Do for You?


BMC recently published an authoritative guide on big data automation. It’s called Managing Big Data Workflows for Dummies. Download now and learn to manage big data workflows to increase the value of enterprise data.

Download Now ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

Share This Post


Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Chile. He specializes in big data, analytics, and cloud architecture. Find him on LinkedIn or at Southern Pacific Review, where he publishes short stories, poems, and news.