Using Tensorflow to Create Neural Network with Tripadvisor Data: Part I

BY

When people are trying to learn neural networks with TensorFlow they usually start with the handwriting database. This builds a model that predicts what digit a person has drawn based upon handwriting samples obtained from thousands of persons. To put that into features-labels terms, the combinations of pixels in a grayscale image (white, black, grey) determine what digit is drawn (0, 1, .., 8, 9).

Here we use other data.

Prerequisites

Before reading this TensorFlow Neural Network tutorial, you should first study these three blog posts:

Introduction to TensorFlow and Logistic Regression
What is a Neural Network? Introduction to Neural Networks Part I
Introduction to Neural Networks Part II

Then you need to install TensorFlow. The easiest way to do that on Ubuntu is to follow these instructions and use virtualenv.

Then install Python Pandas, numpy, scikit-learn, and SciPy packages.

The Las Vegas Strip Hotel Dataset from Trip Advisor

Programmers who are learning to using scikit-learn often start with the iris-data database. That given the combination of pixels that show what type of Iris flower is drawn. You add that data to a Python program by using from sklearn import datasets.

But we want to do something original here instead of use the Iris dataset. So we will use the Las Vegas Strip Data Set, cited in the paper “Moro, S., Rita, P., & Coelho, J. (2017). Stripping customers’ feedback on hotels through data mining: The case of Las Vegas Strip. Tourism Management Perspectives, 23, 41-52.” and see if we can wrap a neural network around it.

In their paper, the authors wrote a model using the R programming language and used Support Vector Matrices (SVMs) as their algorithm. That is a type of non-linear regression problem. It uses the same approach to solving regular LR problems, which is to find a line that reduces the MSE (mean square error) to its lowest point to build a predictive model. But SVMs take that up a notch in complexity by working with multiple, nonlinear inputs and finds a plane in n-dimensional space and not line on the XY Cartesian Plane.

Here we take the same data and but use a neural network instead of SVM.

The Python Code

Our basic approach is to modify our data so that it fits into this Python program written by Vinh Khuc. Mr Khuc’s LinkedIn resume says he is a data scientist at eBay. His program uses the Iris Dataset. So we modified it to use the Tripadvisor data.

The Data

Click here to see the data in Google Sheets format. The data is too wide to fit on one screen so we show it below in two screen prints. If you read the paper cited above you can get more details about the data but basically it is TripAdvisor data for 21 Hotels along the Las Vegas Strip. The goal is to build a model that will predict what score an individual is likely to give to which hotel.

The score is 1 to 5 and the input are 20 variables described in the spreadsheet below.

The authors of the paper say that certain data—like whether the hotel has a casino, pool, number of stars, or free internet—does not have much bearing on the score given by the hotel guest on Tripadvisor. Rather the factors that most heavily predict the score are the number of reviews the reviewer has written and how long they have been writing reviews. Other factors that influence the score are the day of the week and the month of the year.

Convert Values to Integers

First we need to convert all of those values to integers as machine learning uses arrays of numbers as input. We adopt three approaches:

  1. If the number is already an integer leave it.
  2. If the number is a YES or NO then change it to 1 or 0.
  3. If the element in a string, then use the ordinal string function to change each letter to a integer. Then sum those integers.

import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
RANDOM_SEED = 42
tf.set_random_seed(RANDOM_SEED)
def yesNo(x):
if x=="YES":
return 1
else:
return 0
cols = ['User country', 'Nr. reviews','Nr. hotel reviews','Helpful votes','Score','Period of stay','Traveler type','Pool','Gym','Tennis court','Spa','Casino','Free internet','Hotel name','Hotel stars','Nr. rooms','User continent','Member years','Review month','Review weekday']df = pd.read_csv('LasVegasTripAdvisorReviews-Dataset.csv',sep=';',header=0)df['Casino']=df['Casino'].apply(lambda x : yesNo(x))
df['Gym']=df['Gym'].apply(lambda x : yesNo(x))
df['Pool']=df['Pool'].apply(lambda x : yesNo(x))
df['Tennis court']=df['Tennis court'].apply(lambda x : yesNo(x))
df['Casino']=df['Casino'].apply(lambda x : yesNo(x))
df['Free internet']=df['Free internet'].apply(lambda x : yesNo(x))
df['Spa']=df['Spa'].apply(lambda x : yesNo(x))

Here we change every string to an integer. You would have to save the string-integer combination in some data structure so that later you could see which integer equals what string value.

def toOrd(str):
def toOrd(str):
    x=0
    for l in str:
        x += ord(l)
    return int(x)

cols2 = ['Period of stay', 'Hotel name', 'User country', 'Traveler type', 'User continent', 'Review month', 'Review weekday']for y in cols2:
df[y]=df[y].apply(lambda x: toOrd(x))

Now, we drop Score from out input df Pandas dataframe as that is an output value, which we will put in the Series ttarget.

ttarget = df['Score'].values
df = df.drop('Score',axis=1)
ddata=pd.DataFrame.as_matrix(df,cols).astype(int)

ttarget is an array. As you can see these are discrete values ranging from 1 to 5. So this is a classification problem.

array([5, 3, 5, 4, 4, 3, 4, 4, 4, 3, 2, 3, 2, 3, 3, 4, 1, 4, 3, 2, 4, 1, 4,
2, 4, 4, 5, 3, 5, 5, 5, 3, 3, 3, 4, 3, 4, 4, 4, 2, 3, 3, 3, 4, 4, 4,
4, 3, 3, 4, 2, 4, 3, 4, 5, 2, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 3, 2, 4,
1, 2, 5, 3, 4, 4, 5, 4, 4, 4, 3, 3, 4, 3, 4, 5, 4, 4, 4, 3, 4, 5, 3,
5, 4, 4, 5, 4, 5, 3, 3, 4, 4, 5, 4, 4, 3, 4, 4, 4, 4, 5, 5, 1, 5, 5,
2, 4, 5, 5, 5, 5, 4, 5, 5, 5, 3, 5, 4, 1, 4, 4, 1, 5, 3, 5, 4, 5, 5,
5, 5, 5, 3, 4, 4, 4, 5, 4, 5, 5, 5, 4, 5, 5, 2, 2, 5, 1, 4, 5, 4, 5,
5, 5, 5, 5, 5, 2, 5, 3, 5, 5, 4, 5, 5, 4, 5, 5, 5, 4, 3, 4, 5, 4, 4,
3, 5, 5, 5, 5, 3, 5, 4, 5, 5, 4, 5, 5, 4, 5, 4, 3, 5, 5, 5, 5, 5, 5,
5, 2, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 4, 5, 4, 5, 5, 5, 5, 4,
4, 5, 5, 1, 5, 5, 2, 4, 4, 5, 3, 4, 5, 5, 2, 5, 5, 2, 5, 5, 5, 5, 5,
1, 5, 4, 5, 4, 3, 5, 3, 4, 4, 4, 1, 5, 4, 5, 5, 4, 3, 4, 5, 5, 5, 4,
5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 2, 5, 5, 5, 4, 4, 4, 5, 1, 5,
5, 4, 5, 4, 5, 5, 5, 2, 4, 3, 4, 5, 4, 4, 4, 5, 5, 4, 5, 5, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 3, 5, 4, 4, 5, 4, 5, 5, 3, 5, 5, 5, 5, 4, 2,
3, 5, 3, 5, 4, 3, 4, 5, 5, 3, 5, 4, 3, 5, 5, 3, 5, 2, 2, 5, 5, 3, 3,
3, 3, 5, 4, 3, 5, 4, 5, 5, 4, 4, 4, 4, 5, 4, 5, 4, 5, 5, 5, 5, 4, 5,
5, 4, 4, 5, 5, 5, 4, 3, 5, 5, 4, 3, 4, 3, 4, 4, 5, 4, 5, 4, 5, 3, 5,
5, 4, 4, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 4, 5, 4, 5, 5,
2, 5, 2, 3, 5, 5, 4, 5, 4, 4, 5, 5, 4, 5, 4, 5, 5, 4, 2, 4, 5, 2, 4,
2, 5, 4, 3, 5, 3, 5, 3, 4, 4, 5, 5, 4, 5, 2, 5, 4, 5, 5, 4, 4, 3, 4,
3, 5, 5, 4, 4, 3, 4, 4, 3, 4, 5, 3, 4, 5, 3, 5, 5, 4, 4, 2, 4])

Here is what our data looks like now.

As you can see all the columns have been converted to integers. But we still need to get rid of the column headings and convert this Pandas data from to an array of integers. We do that with ddata=pd.DataFrame.as_matrix(df,cols).astype(int).

df.head() User country Nr. reviews Nr. hotel reviews Helpful votes \
0          233          11                4            13
1          233         119               21            75
2          233          36                9            25
3          160          14                7            14
4          568           5                5             2 
   Period of stay Traveler type Pool Gym Tennis court Spa Casino \
0           582            715     0    1            0    0       0
1           582            844     0    1            0    0       0
2           628            810     0    1            0    0       0
3           628            715     0    1            0    0       0
4           628            413     0    1            0    0       0   
   Free internet Hotel name Hotel stars Nr. rooms User continent \
0              1        3367           3       3773            1245
1              1        3367           3       3773            1245
2              1        3367           3       3773            1245
3              1        3367           3       3773             624
4              1        3367           3       3773            1245   
   Member years Review month Review weekday
0             9         730             852
1             3         730             607
2             2         832             845
3             6         832             607
4             7         491             735 

ddata has 504 rows of 20 columns.

ddata.shape
(504, 20)

The programmer writes: “NOTE: In order to make the code simple, we rewrite x * W_1 + b_1 = x’ * W_1′ where x’ = [x | 1] and W_1′ is the matrix W_1 appended with a new row with elements b_1’s. Similarly, for h * W_2 + b_2.”

What he is referring to is that the nodes in a neural network are elements in the matrix x to multiply by weights w_1 and add bias b_1. This is normally written as wx + b. So he is rearranging the formula to make the next steps easier, as in:

[1,wa, ...
1,wb, ...
1,wc, ...
.
.
,
1, wn, … ]

np.ones makes an matrix of that size filled with 1s.

aall_X is the new matrix with the 1s prepended.

NN, MM = ddata.shape
aall_X = np.ones((NN, MM + 1))
aall_X[:, 1:] = ddata
nnum_labels = len(np.unique(ttarget)) + 1
aall_Y = np.eye(nnum_labels)[ttarget]

But someone commenting on his code said there is a better way: “Biases do not need to be added as additional column to input vector but as additional term as in the following: h = tf.nn.sigmoid(tf.add(tf.matmul(X, w_1)), b).”

In part II we turn our attention to the neural network functions.

Related posts:

Want to Learn More About Big Data and What It Can Do for You?

BMC recently published an authoritative guide on big data automation. It’s called Managing Big Data Workflows for Dummies. Download now and learn to manage big data workflows to increase the value of enterprise data.

Download Now ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

Share This Post


Walker Rowe

Walker Rowe

Walker Rowe is an American freelance tech writer and programmer living in Chile. He specializes in big data, analytics, and cloud architecture. Find him on LinkedIn or at Southern Pacific Review, where he publishes short stories, poems, and news.