Machine Learning & Big Data Blog

Using Tensorflow to Create Neural Network with TripAdvisor Data: Part II

Walker Rowe
by Walker Rowe
4 minute read

In Part One we explained the problem we want to solve, which is predict how someone might rate one of the Las Vegas hotels on TripAdvisor given how other people have done that. Here we write the code to build a neural network to do that. In this part we will create the training model. In the next blog post we will make predictions.

Prerequisites

  • Python 3
  • You need to install Tensorflow in Python 3, i.e., pip3 install –upgrade tensorflow
  • Download this data. This is this Trip Advisor data converted to integers using this program. It has these column headings. All of these items are features. Score is the label.

User country,Nr. reviews,Nr. hotel reviews,Helpful votes,Score,Period of stay,Traveler type,Pool,Gym,Tennis court,Spa,Casino,Free internet,Hotel name,Hotel stars,Nr. rooms,User continent,Member years,Review month,Review weekday

Below is the code, which you can copy from here. We explain each section.

Below we put each column name into an array. There are 21 columns. The FIELD_DEFAULTS are given as 21 integers. We use integers to tell TensorFlow that these are integers and not floats.

import tensorflow as tf
feature_names = ['Usercountry', 'Nrreviews','Nrhotelreviews','Helpfulvotes','Score','Periodofstay',
'Travelertype','Pool','Gym','Tenniscourt','Spa','Casino','Freeinternet',
'Hotelname','Hotelstars','Nrrooms','Usercontinent','Memberyears',
'Reviewmonth','Reviewweekday'] FIELD_DEFAULTS = [[0], [0], [0], [0], [0],
[0], [0], [0], [0], [0],
[0], [0], [0], [0], [0],
[0], [0], [0], [0], [0], [0]]

Next we want to read the data as a .csv file. Tensorflow provides the tf.decode_csv() method to read one line at a time. We use the dataset map() method to call parse_line for each line in the dataset. This creates a TensorFlow dataset, which is not a normal Python dataset. It is designed to work with Tensors. If you do not know what a Tensor is you can review this.

This routine returns the features as a dictionary and the label as a label. Notice that we delete the Score (parsed_line[4]) from the features since Score is not a feature. It is a label. The dict(zip()) methods put the key names in the dictionary,

def parse_line(line):
parsed_line = tf.decode_csv(line, FIELD_DEFAULTS)
label = parsed_line[4] del parsed_line[4] features = parsed_line
d = dict(zip(feature_names, features))
print ("dictionary", d, " label = ", label)
return d, label

Tensorflow provides the tf.data.TextLineDataset() method to read a .csv file into a TensorFLow dataset. tf.estimator.DNNClassifier.train() requires that we call some function, in this case csv_input_fn(), which returns a dataset of features and labels. We use dataset.shuffle() since that is used when you create neural network.

def csv_input_fn(csv_path, batch_size):
dataset = tf.data.TextLineDataset(csv_path)
dataset = dataset.map(parse_line)
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
return dataset

We have to create Tensors for each column in the dataset. We have both categorical data (e.g., 0 and 1) and numbers, e.g., number of reviews.

Categorical data set encode with, e.g., which means there are 47 categories. In other words our same data comes from people from 47 different countries. We can use df[‘User continent’].groupby(df[‘User continent’]).count(), for example, to count the unique elements.

tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Usercountry",47))

Numeric data we encode with, for example:

Nrreviews = tf.feature_column.numeric_column("Nrreviews")

Here is that full section. Notice in the last line we create the array of Tensors in feature_columns.

Usercountry = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Usercountry",47))
Nrreviews = tf.feature_column.numeric_column("Nrreviews")
Nrhotelreviews = tf.feature_column.numeric_column("Nrhotelreviews")
Helpfulvotes = tf.feature_column.numeric_column("Helpfulvotes")
Periodofstay = tf.feature_column.numeric_column("Periodofstay")
Travelertype = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Travelertype",5))
Pool = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Pool",2))
Gym = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Gym",2))
Tenniscourt = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Tenniscourt",2))
Spa = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Spa",2))
Casino = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Casino",2))
Freeinternet = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Freeinternet",2))
Hotelname = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Hotelname",24))
Hotelstars = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Hotelstars",5))
Nrrooms = tf.feature_column.numeric_column("Nrrooms")
Usercontinent = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Usercontinent",6))
Memberyears = tf.feature_column.numeric_column("Memberyears")
Reviewmonth = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Reviewmonth",12))
Reviewweekday = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity("Reviewweekday",7))
feature_columns = [Usercountry, Nrreviews,Nrhotelreviews,Helpfulvotes,Periodofstay,
Travelertype,Pool,Gym,Tenniscourt,Spa,Casino,Freeinternet,Hotelname,Hotelstars,Nrrooms,Usercontinent,Memberyears,Reviewweekday]

Here is the tf.estimator.DNNClassifier, where DNN means Deep Neural Network. We give it the feature columns and the directory where it should store the model. We also say there are 5 classes since hotel scores range from 1 to 5. For hidden units we pick [10, 10]. This means the first layer of the neural network has 10 nodes and the next layer has 10. You can read more about how to pick that number by reading, for example, this StackOverflow article. I do not yet know if this is the correct value. We will see when we make predictions in the next post.

classifier=tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[10, 10],
n_classes=5,
model_dir="/tmp")
batch_size =

Finally we call the train() method and give it an inplace (lambda) call to csv_input_fn and the path from which we read the csv file.

classifier.train(
steps=1000,
input_fn=lambda : csv_input_fn("/home/walker/tripAdvisorFL.csv", batch_size))

In the next blog post we will show how to make predictions from this model, meaning estimate how a customer might rate a hotel given their characteristics. The hotel could then decide how much effort they might want to expend to make this customer happy or expend no effort at all.

Wikibon: Automate your Big Data pipeline

Learn how data management experts throughout the industry are transforming their Big Data infrastructure for maximum business impact.
Download Now ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

About the author

Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Tunisia. He specializes in big data, analytics, and programming languages. Find him on LinkedIn or Upwork.