Pandas Tutorial Guide – BMC Software | Blogs https://s7280.pcdn.co Fri, 05 Feb 2021 07:35:32 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Pandas Tutorial Guide – BMC Software | Blogs https://s7280.pcdn.co 32 32 Top NumPy Statistical Functions & Distributions https://s7280.pcdn.co/numpy-statistical-functions/ Wed, 27 Jan 2021 15:17:01 +0000 https://www.bmc.com/blogs/?p=20061 NumPy supports many statistical distributions. This means it can generate samples from a wide variety of use cases. For example, NumPy can help to statistically predict: The chances of rolling a 7 (i.e, winning) in a game of dice How likely someone is to get run over by a car How likely it is that […]]]>

NumPy supports many statistical distributions. This means it can generate samples from a wide variety of use cases. For example, NumPy can help to statistically predict:

  • The chances of rolling a 7 (i.e, winning) in a game of dice
  • How likely someone is to get run over by a car
  • How likely it is that your car will breakdown
  • How many people will be in line at the checkout counter

We explain by way of examples.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Randomness & the real work

The NumPy functions don’t calculate probability. Instead they draw samples from the probability distribution of the statistic—resulting in a curve. The curve can be steep and narrow or wide or reach a small value quickly over time.

Its pattern varies by the type of statistic:

  • Normal
  • Weibull
  • Poisson
  • Binomial
  • Uniform
  • Etc.

Most phenomena in the real world are truly random. For example, if we toss out nearsightedness, clumsiness, and absentmindness, then the chance that someone would get hit by a car is equal for all peoples.

The normal distribution reflects this.

When you use the random() function in programming languages, you are saying to pick from the normal distribution. Samples will tend to hover about some middle point, known as the mean. And the volatility of observations is called the variance. As the name suggests, if it varies a lot then the variance is large.

Let’s look at these distributions.

Normal

The arguments for the normal distribution are:

  • loc is the mean
  • scale is the square root of the variance, i.e. the standard deviation
  • size is the sample size or the number of trials. 400 means to generate 400 random numbers. We write (400,) but could have written 400. This shows that the values can be more than one dimension. We are just picking numbers here and not any kind of cube or other dimension.
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(loc=0,scale=1,size=(400,))
plt.plot(arr)

Notice in this that the numbers hover about the mean, 0:

Weibull

Weibull is most often used in preventive maintenance applications. It’s basically the failure rate over time. In terms of machines like truck components this is called Time to Failure. Manufacturers publish for planning purposes.

A Weibull distribution has a shape and scale parameter. Continuing with the truck example:

  • Shape is how quickly over time the component is likely to fail, or the steepness of the curve.
  • NumPy does not require the scale distribution. Instead, you simply multiply the Weibull value by scale to determine the scale distribution.
import numpy as np
import matplotlib.pyplot as plt
shape=5
arr = np.random.weibull(shape,400)
plt.hist(arr)

This histogram shows the count of unique observations, or frequency distribution:

Poisson

Poisson is the probability of a given number of people in the lines over a period of time.

For example, the length of a queue in a supermarket is governed by the Poisson distribution. If you know that, then you can continue shopping until the line gets shorter and not wait around. That’s because the line length varies, and varies a lot, over time. It’s not the same length all day. So, go shopping or wander the store instead of waiting in the queue.

import matplotlib.pyplot as plt
arr = np.random.poisson(2,400)
plt.plot(arr)

Here we see the line length varies between 8 and 0, The number function does not return a probability. Remember that it returns an observation, meaning it picks a number subject to the Weibull statistical cure.

Binomial

Binomial is discrete outcomes, like rolling dice.

Let’s look at the game of craps. You roll two dice, and you win when you get a 7. You can get a 7 with these rolls:

  • 1,6
  • 2,5
  • 3,4
  • 4,3
  • 5,2
  • 6,1

So, there are six ways to win. There are 6*6*36 possibilities. So, the chance of winning is 6/16=⅙.

To simulate 400 rolls of the dice, use:

import numpy as np
import matplotlib.pyplot as plt
arr = np.random.binomial(36,1/6,400)
plt.hist(arr)

In the 400 trials, two 6s were rolled about three times.

Uniform

Uniform distribution varies at equal probability between a high and low range.

import numpy as np
import matplotlib.pyplot as plt
arr = np.random.uniform(-1,0,1000)
plt.hist(arr)

Related reading

]]>
Using the NumPy Bincount Statistical Function https://www.bmc.com/blogs/numpy-bincount-function/ Wed, 20 Jan 2021 13:19:57 +0000 https://www.bmc.com/blogs/?p=20033 NumPy does a lot more than create arrays. This workhorse also does statistics and functions, such as correlation, which are important for scientific computing and machine learning. We start our survey of NumPy statistical functions with bincount(). (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) The bincount function In […]]]>

NumPy does a lot more than create arrays. This workhorse also does statistics and functions, such as correlation, which are important for scientific computing and machine learning.

We start our survey of NumPy statistical functions with bincount().

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

The bincount function

In NumPy, the bincount function counts the number of unique values in an array.

First we make an array with:

  • Three 1s
  • Two 2s
  • Five 4s
  • One 5
arr = np.array([1,1,1,2,2,3,4,4,4,4,4,5])

Results in:

array([1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4, 5])

Then we use the NumPy bincount() function to count unique elements.

d=np.bincount(arr)

Results in an array of counts by index position. In other words, it counts from left to right.

Note the 0 in front. For whatever odd reason, NumPy returns one more bin than the size of the array.  So, we will make some adjustments for that.

array([0, 3, 2, 1, 5, 1])

We make an array with unique elements from arr. We do this so we can plot the count against the values later.

a=np.unique(arr)

Results in:

array([1, 2, 3, 4, 5])

Because NumPy returns one more bin than the size of the array, we insert a 0 at the beginning so that the unique count and the bincount are the same shape so we can plot them.

b=np.insert(arr,0,[0])

This gives us:

array([0, 1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4, 5])

Then we make a unique list out of that:

c=np.unique(b)

Now it has the extra 0 to line up with the bincount

array([0, 1, 2, 3, 4, 5])

Now c and d are the same shape, so we can plot them using Matplotlib.

plt.bar(c,d)

Results in this chart:

As you can see, there are:

  • Five elements with value 0
  • One element with value 3

The complete code

Here is the complete code.

import numpy as np
import matplotlib.pyplot as plt
arr = np.array([1,1,1,2,2,3,4,4,4,4,4,5])
d=np.bincount(arr)
a=np.unique(arr)
b=np.insert(arr,0,[0])
c=np.unique(b)
plt.bar(c,d)

Related reading

]]>
Using StringIO to Read Delimited Text Files into NumPy https://www.bmc.com/blogs/numpy-text-files-stringio/ Tue, 12 Jan 2021 14:06:09 +0000 https://www.bmc.com/blogs/?p=19923 In this tutorial, we’ll show you how to read delimited text data into a NumPy array using the StringIO package. (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) Data we used We will read this crime data: ,crime$cluster,Murder,Assault,UrbanPop,Rape Alabama,4,13.2,236,58,21.2 Alaska,4,10,263,48,44.5 Arizona,4,8.1,294,80,31 Arkansas,3,8.8,190,50,19.5 California,4,9,276,91,40.6 Colorado,3,7.9,204,78,38.7 Connecticut,2,3.3,110,77,11.1 Delaware,4,5.9,238,72,15.8 Florida,4,15.4,335,80,31.9 Parameters In […]]]>

In this tutorial, we’ll show you how to read delimited text data into a NumPy array using the StringIO package.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Data we used

We will read this crime data:

,crime$cluster,Murder,Assault,UrbanPop,Rape
Alabama,4,13.2,236,58,21.2
Alaska,4,10,263,48,44.5
Arizona,4,8.1,294,80,31
Arkansas,3,8.8,190,50,19.5
California,4,9,276,91,40.6
Colorado,3,7.9,204,78,38.7
Connecticut,2,3.3,110,77,11.1
Delaware,4,5.9,238,72,15.8
Florida,4,15.4,335,80,31.9

Parameters

In the code below, we download the data using urllib. Then we use np.genfromtxt to import it to the NumPy array. Note the following parameters:

delimiter=”,” The delimiter between columns.
skip_header=1 We skip the header since that has column headers and not data.
dtype=dtypes This parameter means use the tuples (name, dtype) to convert the data using the name as the assigned numpy dtype (data type).

If we don’t want to assign names we would use (dtype1, dtype2, …).

Note that we use the type float. Since NumPy is built using the C language, you can use any of the many ctypes, like 32 bit integers etc.

We use S12 for str as str converts this data to ” “. You could also use unicode U12.

We also could have written np.string_ and np.unicode_ but that does not give any length, so it means a null terminated byte, which is not a string. So, it would return a blank space.

We could have used object as well.

Note that NumPy uses these names:

·        dtype=[(‘crime’, ‘S12’), (‘cluster’, ‘<f8’), (‘Murder’, ‘<f8’), (‘Assault’, ‘<f8’), (‘UrbanPop’, ‘<f8’), (‘Rape’, ‘<f8’)])

·        The < sign refers to the byte order which can be little-endian or big-endian.

usecols=(1,5) We did not use this parameter. If we had used it, it would have skipped the first column.

The code explained

Here is the code:

import urllib
import numpy as np
from io import StringIO
url = "https://raw.githubusercontent.com/werowe/MLexamples/master/crime_data.csv"
file = urllib.request.urlopen(url)
data = ""
for d in file:
data = data + d.decode('utf-8')
dtypes=[('crime',"S12"),
('cluster', float),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]
arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1,
dtype=dtypes)

Results in:

array([(b'Alabama', 4., 13.2, 236., 58., 21.2),
(b'Alaska', 4., 10. , 263., 48., 44.5),

Note that NumPy returned a byte array for the string column. If we want a string, we can use Unicode:

dtypes=[('crime','U25'),
('cluster', '>f'),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]

Results in:

array([('Alabama', 4., 13.2, 236., 58., 21.2),
('Alaska', 4., 10. , 263., 48., 44.5),

If we leave off dtypes and let NumPy pick the data types, it NaN (missing data) to the string column. It also uses float as the default for all numeric values.

arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1)

Results in:

array([[  nan,   4. ,  13.2, 236. ,  58. ,  21.2],
[  nan,   4. ,  10. , 263. ,  48. ,  44.5],

Having assigned names to columns we can refer to their name instead of index:

arr['Murder']
array([13.2, 10. ,  8.1,  8.8,  9. ,  7.9,  3.3,  5.9, 15.4, 17.4,  5.3,
2.6, 10.4,  7.2,  2.2,  6. ,  9.7, 15.4,  2.1, 11.3,  4.4, 12.1,
2.7, 16.1,  9. ,  6. ,  4.3, 12.2,  2.1,  7.4, 11.4, 11.1, 13. ,
0.8,  7.3,  6.6,  4.9,  6.3,  3.4, 14.4,  3.8, 13.2, 12.7,  3.2,
2.2,  8.5,  4. ,  5.7,  2.6,  6.8])

Missing values

We can tell NumPy to plug in a value for a missing value, like -1, using missing_values. The default behavior for floats is np.nan. For int it is -1.

Alaska,4,10,263,48,44.5
Arizona,4, ,1,294,80,31

That concludes this tutorial.

Related reading

]]>
NumPy Introduction with Examples https://www.bmc.com/blogs/numpy-introduction/ Thu, 07 Jan 2021 12:11:27 +0000 https://www.bmc.com/blogs/?p=19869 If we study Pandas, we have to study NumPy, because Pandas includes NumPy. Here, I’ll introduce NumPy and share some basic functions. (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) What is NumPy? NumPy is a package that create arrays. It lets you make arrays of numbers with different […]]]>

If we study Pandas, we have to study NumPy, because Pandas includes NumPy. Here, I’ll introduce NumPy and share some basic functions.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

What is NumPy?

NumPy is a package that create arrays. It lets you make arrays of numbers with different precision and scale, plus string, so it is especially useful for scientific computing.

Python by itself only has floats, integers, and imaginary numbers. But NumPy expands what Python can do because it handles:

  • 32-bit numbers
  • 15 big numbers
  • Signed numbers
  • Unsigned numbers
  • And more

But that’s not the only reason to use NumPy. It’s designed for efficiency and scale, making it the workhouse for large machine learning (ML) libraries like TensorFlow.

tensor flow

Now, let’s take a look at some basic functions of NumPy arrays.

Creating a NumPy array

Create an array with np.array(<array>).

Don’t put np.array(1,2,3,4,5) as 1,2,3,4,5 is not an array. NumPy would interpret the items after the commas as parameters to the array() function.

This creates an array:

import numpy as np
arr = np.array([1,2,3])
arr

Results:

array([1,2,3])

Array shape

An array has shape, just like, for example, a 2×2 array, 2×1 array, etc.

Query the shape like this:

arr.shape

You should call this a vector if you want to understand this better as it’s not 3×1—because it only has one dimension, and a blank is not a dimension.

(3,)

This is 3×1 since it is an array of 3 arrays of dimension 1×1.

arr = np.array([[1],[2],[3]])
arr.shape

Results:

(3, 1)

Reshaping an array

You can reshape an array of shape m x n into any combination that is a divisor of m x n. This array of shape (6,) can be reshaped to 2×3 since 2*3=6 divides 6.

import numpy as np
arr = np.array([1,2,3,4,5,6]).reshape(2,3)
print(arr)

Results:

[[1 2 3]

[4 5 6]]

Arange

Notice that this function is not arrange but arange, as in array range. Use it to file an array with numbers. (There are lots of ways to do that, a topic that we will cover in a subsequent post.)

import numpy as np
arr = np.arange(5)
arr

Results in:

array([0, 1, 2, 3, 4])

Slice

Slicing an array is a difficult topic that becomes easier with practice. Here are some simple examples.

Take this array.

arr = np.array([1,2,3,4,5,6]).reshape(2,3)
arr

Which looks like this:

array([[1, 2, 3],
[4, 5, 6]])

(While you could say this has 2 rows and 3 columns to make it easier to understand, that’s not technically correct. When you have more than two dimensions, the concept or rows and columns goes away. So that’s why it’s better to say dimensions and axes.)

This slice operations means start at the second position of the first axis and go to the end:

arr[1:]

Results in:

array([[4, 5, 6]])

This starts at the beginning and goes to the end:

arr[0:]

Results in:

array([[1, 2, 3],
[4, 5, 6]])

Add a comma to specify which column:

arr[:,1]

Results in:

array([2, 5])

Select along the other axis like this:

arr[1,:]

Results in:

array([4, 5, 6])

Select a single element.

arr[1,0]

Results:

4

Step

arr = np.array([1,2,3,4,5,6])
arr[1:6:2]
array([2, 4, 6])

That concludes this introduction.

Related reading

]]>
Handling Missing Data in Pandas: NaN Values Explained https://www.bmc.com/blogs/pandas-nan-missing-data/ Wed, 23 Dec 2020 12:49:11 +0000 https://www.bmc.com/blogs/?p=19755 In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days. You have a couple of alternatives to work with missing data. You can: Drop the whole row Fill the row-column combination with some value It would not make […]]]>

In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days.

You have a couple of alternatives to work with missing data. You can:

  • Drop the whole row
  • Fill the row-column combination with some value

It would not make sense to drop the column as that would throw away that metric for all rows. So, let’s look at how to handle these scenarios.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

NaN means missing data

Missing data is labelled NaN.

Note that np.nan is not equal to Python None. Note also that np.nan is not even to np.nan as np.nan basically means undefined.

Here make a dataframe with 3 columns and 3 rows. The array np.arange(1,4) is copied into each row.

import pandas as pd
import numpy as np
df = pd.DataFrame([np.arange(1,4)],index=['a','b','c'],
columns=["X","Y","Z"]) 

Results:

Now reindex this array adding an index d. Since d has no value it is filled with NaN.

df.reindex(index=['a','b','c','d'])

isna

Now use isna to check for missing values.

pd.isna(df)

notna

The opposite check—looking for actual values—is notna().

pd.notna(df)

nat

nat means a missing date.

df['time'] = pd.Timestamp('20211225')
df.loc['d'] = np.nan

fillna

Here we can fill NaN values with the integer 1 using fillna(1). The date column is not changed since the integer 1 is not a date.

df=df.fillna(1)

To fix that, fill empty time values with:

df['time'].fillna(pd.Timestamp('20221225'))

dropna()

dropna() means to drop rows or columns whose value is empty. Another way to say that is to show only rows or columns that are not empty.

Here we fill row c with NaN:

df = pd.DataFrame([np.arange(1,4)],index=['a','b','c'],
columns=["X","Y","Z"])
df.loc['c']=np.NaN

Then run dropna over the row (axis=0) axis.

df.dropna()

You could also write:

df.dropna(axis=0)

All rows except c were dropped:

To drop the column:

df = pd.DataFrame([np.arange(1,4)],index=['a','b','c'],
columns=["X","Y","Z"])
df['V']=np.NaN

df.dropna(axis=1)

interpolate

Another feature of Pandas is that it will fill in missing values using what is logical.

Consider a time series—let’s say you’re monitoring some machine and on certain days it fails to report. Below it reports on Christmas and every other day that week. Then we reindex the Pandas Series, creating gaps in our timeline.

import pandas as pd
import numpy as np
arr=np.array([1,2,3])
idx=np.array([pd.Timestamp('20211225'),
pd.Timestamp('20211227'),
pd.Timestamp('20211229')])
df = pd.DataFrame(arr,index=idx)
idx=[pd.Timestamp('20211225'),
pd.Timestamp('20211226'),
pd.Timestamp('20211227'),
pd.Timestamp('20211228'),
pd.Timestamp('20211229')]
df=df.reindex(index=idx)

We use the interpolate() function. Pandas fills them in nicely using the midpoints between the points. Of course, if this was curvilinear it would fit a function to that and find the average another way.

df=df.interpolate()

That concludes this tutorial.

Related reading

]]>
Pandas Data Types https://www.bmc.com/blogs/pandas-data-types/ Wed, 16 Dec 2020 11:24:58 +0000 https://www.bmc.com/blogs/?p=19734 Regular Python does not have many data types. It only has string, float, binary, and complex numbers. There is no longer or short. There are no 32- or 64-bit numbers. Luckily, for most situations, this doesn’t matter. It only matters when you require absolute precision or want to use the minimum amount of memory to […]]]>

Regular Python does not have many data types. It only has string, float, binary, and complex numbers. There is no longer or short. There are no 32- or 64-bit numbers. Luckily, for most situations, this doesn’t matter. It only matters when you require absolute precision or want to use the minimum amount of memory to store a value.

This matters when you are working with very large Pandas arrays since Pandas, if you recall, is limited by memory size. So, you would not want to use 64 bits when you only need 8.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Python numeric precision

Size and precision are different. But it’s worth looking at precision in order to understand some of the limitations of Python. Precision means the number of decimal places.

For example, ⅓ in decimal format is the infinite series 0.3333333333333… The computer is not infinite, so at some point Python stops the decimal expansion and rounds that number off.

This behavior has some quirks.

When does zero equal zero?

Look at this code. It calculates the square root of 2 (or any other number) using a 3,000-year-old technique. The algorithm quits when the difference between two numbers is 0.

a=2
a1 = (a/2)+1
b1 = a/a1
aminus1 = a1
bminus1 = b1
while (aminus1-bminus1 != 0):
an = 0.5 * (aminus1 + bminus1)
bn = a / an
aminus1 = an
bminus1 = bn
print(an,bn,an-bn)

produces:

1.5 1.3333333333333333 0.16666666666666674
1.4166666666666665 1.411764705882353 0.004901960784313486
1.4142156862745097 1.41421143847487 4.2477996395895445e-06
1.4142135623746899 1.4142135623715002 3.1896707497480747e-12
1.414213562373095 1.4142135623730951 -2.220446049250313e-16

But you can see, just above, that the difference is never 0. Instead the computer rounds off, apparently in this case, at 16 significant digits. You can see this—when it reaches the answer 1.414213562373095, the code keeps running. That’s because the difference between the two values does not converge to 0 despite taking the arithmetic mean of the two values.

If you want to understand how this algorithm works check out this clip:

 
In case you do want precise control over the number of decimal places, you can use the Python class Decimal. This keeps that code from running in a continuous loop.

from decimal import *
getcontext().prec = 5
a=Decimal(2)
a1 = Decimal((a/2)+1)
b1 = Decimal(a/a1)
aminus1 = a1
bminus1 = b1
i=0
while (Decimal(aminus1-bminus1) != Decimal(0)):
an = Decimal(Decimal(0.5) * (aminus1 + bminus1))
bn = Decimal(a / an)
aminus1 = an
bminus1 = bn
i=i+1
print(i,an,bn,an-bn)

It loops three times and quits, settling on the value of 1.4142 as the square root of 2. We could add more decimal places to get more precision.

1 1.5 1.3333 0.1667

2 1.4166 1.4118 0.0048

3 1.4142 1.4142 0.0000

NumPy & Pandas numeric data types

NumPy goes much further than that. It provides a low-level interface to c-type numeric types. (In other words, those numbers that you could declare when writing code in the C language). While we won’t discuss those here yet, it also gives different data types suitable for:

  • Time series
  • Intervals
  • Dates
  • Categorical data
  • Etc.

Below is the complete list. On the left is the object name. On the right is the string alias.

Data Type Alias
Int64Dtype, … ‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’
 

IntervalDtype

‘interval’, ‘Interval’, ‘Interval[<numpy_dtype>]’, ‘Interval[datetime64[ns, <tz>]]’, ‘Interval[timedelta64[<freq>]]’
DatetimeTZDtype ‘datetime64[ns, <tz>]’
CategoricalDtype ‘category’
PeriodDtype ‘period[<freq>]’, ‘Period[<freq>]’
SparseDtype ‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’
BooleanDtype ‘boolean’
StringDtype ‘string’
none of the above ‘object’

Integers & Floats

Here, for example, we have a date, float, boolean, and integer. We can let Pandas pick a scale for each numeric type. Or we can give that explicitly. If you don’t give it explicitly Pandas either picks one or uses the generic object.

import pandas as pd
import datetime
df= pd.DataFrame({
"a": datetime.datetime(2020,12,14),
"b": 1.000003,
"c": True,
"d": 3
}, index=[1,2,3,4])

Ask Pandas for the data types:

df.dtypes

You can see it chooses 64 bits to store 1.000003 and 3. You only need 2 bits to store the number 3, but there is no option for 2-bit numbers. So, we would use int8 and use 8 bits, if space was a concern.

a    datetime64[ns]
b           float64
c              bool
d             int64
dtype: object

Now, make a Pandas series of 4 integers and coerce it to an 8 bit number.

s=pd.Series([10,20,30,40],index=[1,2,3,4]).astype('int8')

Use dtypes to show the data types:

s.dtypes

Results in:

dtype('int8')

The string ‘int8’ is an alias. You can also assign the dtype using the Pandas object representation of that pd.Int64Dtype.

t = pd.Int64Dtype
pd.Series([1,2,3,4], dtype=t)

Related reading

]]>
How To Group, Concatenate & Merge Data in Pandas https://www.bmc.com/blogs/pandas-group-merge-concatenate-join/ Wed, 02 Dec 2020 07:52:19 +0000 https://www.bmc.com/blogs/?p=19508 In this tutorial, we show how to group, concatenate, and merge Pandas DataFrames. (New to Pandas? Start with our Pandas introduction or create a Pandas dataframe from a dictionary.) These operations are very much similar to SQL operations on a row and column database. Pandas, after all, is a row and column in-memory data structure. […]]]>

In this tutorial, we show how to group, concatenate, and merge Pandas DataFrames. (New to Pandas? Start with our Pandas introduction or create a Pandas dataframe from a dictionary.)

These operations are very much similar to SQL operations on a row and column database. Pandas, after all, is a row and column in-memory data structure. If you’re a SQL programmer, you’ll already be familiar with all of this. The only complexity here is that you can join by columns in addition to rows.

Pandas uses the function concatenation—concat(), aka concat. But it’s easier to understand if you think of these are inner joins (intersection) and outer joins (union) of sets, which is how I refer to it below.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Concatenation (Outer join)

Think of concatenation like an outer join. The result is the same.

Suppose we have dataframes A and B with common elements among the indexes and columns.

Now concatenate. It’s not an append. (There is an append() function for that.) This concat() operation creates a superset of both sets a and b but combines the common rows. It’s not an inner join, either, since it lists all rows even those for which there is no common index.

Notice the missing values NaN. This is where there are no corresponding dataframe indexes in Dataframe B with the index in Dataframe A.

For example, index 3 is in both dataframes. So, Pandas copies the 4 columns from the first dataframe and the 4 columns from the second dataframe to the newly constructed dataframe. Similarly, index 5 is in Dataframe B but not Dataframe A for columns 1,2, 3. So those columns are marked as missing (NaN).

a = pd.DataFrame({'column1': ['A', 'C', 'D', 'E'],

'column2': ['F', 'G', 'H', 'I'],

'column3': ['J', 'K', 'L', 'M'],

'column4': ['N', 'O', 'P', 'Q']},

index=[1,2,3,4])

b = pd.DataFrame({'column3': ['R', 'S', 'T', 'U'],

'column5': ['V', 'W', 'X', 'Y'],

'column6': ['Z', 'α', 'β', 'υ'],

'column7': ['σ', 'χ', 'ι', 'κ']},

index=[3,4,5,6])

result = pd.concat([a, b], axis=1)

Results in:

Outer join

Here we do an outer join, which, in terms of sets, means the union of two sets. So, all rows are added. An outer join here does not create the intersection of common indexes. But still, for those for which the column does not exist in the set of all columns the value is NaN. That has to be the case since not all columns exist for all rows. So, you have to list all of them but mark some of them as empty.

result = pd.concat([a, b], join='outer')

Inner join along the 0 axis (Row)

We can do an inner join by index or column. An inner join finds the intersection of two sets.

Let’s join along the 0 axis (row). Only indices 3 and 4 are in both dataframes. So, an inner join takes all columns from only those two rows.

result = pd.concat([a, b], axis=1,join='inner')

 Inner join along the 1 axis (Column)

Column3 is the only column common to both dataframe. So, we concatenate all the rows from A with the rows in B and select only the common column, i.e., an inner join along the column axis.

result = pd.concat([a, b], axis=0,join='inner')

Merge

A merge is like an inner join, except we tell it what column to merge on.

Here, make the first column name (i.e., key value in the dictionary) some common name “key”. Then we merge on that.

a = pd.DataFrame({'key': ['A', 'C', 'D', 'E'],

'column2': ['F', 'G', 'H', 'I'],

'column3': ['J', 'K', 'L', 'M'],

'column4': ['N', 'O', 'P', 'Q']},

index=[1,2,3,4])

b = pd.DataFrame({'key': ['C', 'D', 'T', 'U'],

'column5': ['V', 'W', 'X', 'Y'],

'column6': ['Z', 'α', 'β', 'υ'],

'column7': ['σ', 'χ', 'ι', 'κ']},

index=[3,4,5,6])

result=pd.merge(a, b, on='key')

The resulting dataframe is the one which has a common key value from the array key=[].You see that only C and D are common to both data frames.

Append

This simply appends one dataframe onto another. Here is a Series, which is a DataFrame with only one column. The result is all rows from Dataframe A added to Dataframe B to create Dataframe C.

import pandas as pd

a=pd.DataFrame([1,2,3])

b=pd.DataFrame([4,5,6])

c=a.append(b)

c

GroupBy

Here is another operation familiar to SQL programmers.

The GroupBy operation is on a single dataframe. We group the values from the column named key and sum them.

a=pd.DataFrame({"key": ["a", "b", "c","a", "b", "c"] ,

"values": [1,2,3,1,2,3]})

b=a.groupby("key").sum()

b

Related reading

 

]]>
How To Create a Pandas Dataframe from a Dictionary https://www.bmc.com/blogs/pandas-create-dataframe-dictionary/ Thu, 26 Nov 2020 00:00:48 +0000 https://www.bmc.com/blogs/?p=19429 Here is yet another example of how useful and powerful Pandas is. Pandas can create dataframes from many kinds of data structures—without you having to write lots of lengthy code. One of those data structures is a dictionary. In this tutorial, we show you two approaches to doing that. (This tutorial is part of our […]]]>

Here is yet another example of how useful and powerful Pandas is. Pandas can create dataframes from many kinds of data structures—without you having to write lots of lengthy code. One of those data structures is a dictionary.

In this tutorial, we show you two approaches to doing that.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

A word on Pandas versions

Before you start, upgrade Python to at least 3.7. With Python 3.4, the highest version of Pandas available is 0.22, which does not support specifying column names when creating a dictionary in all cases.

If you are running virtualenv, create a new Python environment and install Pandas like this:

virtualenv py37  --python=python3.7
pip install pandas

You can check the Pandas version with:

import pandas as pd
pd.__version__

Create dataframe with Pandas DataFrame constructor

Here we construct a Pandas dataframe from a dictionary. We use the Pandas constructor, since it can handle different types of data structures.

The dictionary below has two keys, scene and facade. Each value has an array of four elements, so it naturally fits into what you can think of as a table with 2 columns and 4 rows.

Pandas is designed to work with row and column data. Each row has a row index. By default, it is the numbers 0, 1, 2, 3, … But it also lets you use names.

So, let’s use the same in the array idx.

import pandas as pd

dict =  {'scene': ["foul", "murder", "drunken", "intrigue"],

'facade': ["fair", "beaten", "fat", "elf"]}

idx = ['hamlet', 'lear', 'falstaff','puck']

dp = pd.DataFrame(dict,index=idx)

Here is the resulting dataframe:

Create dataframe with Pandas from_dict() Method

Pandas also has a Pandas.DataFrame.from_dict() method. If that sounds repetitious, since the regular constructor works with dictionaries, you can see from the example below that the from_dict() method supports parameters unique to dictionaries.

In the code, the keys of the dictionary are columns. The row indexes are numbers. That is default orientation, which is orient=’columns’ meaning take the dictionary keys as columns and put the values in rows.

pd.DataFrame.from_dict(dict)


Now we flip that on its side.  We will make the rows the dictionary keys.

pd.DataFrame.from_dict(dict,orient='index')

Notice that the columns have no names, only numbers. That’s not very useful, so below we use the columns parameter, which was introduced in Pandas 0.23.

It’s as simple as putting the column names in an array and passing it as the columns parameter. One wonders why the earlier versions of Pandas did not have that.

pd.DataFrame.from_dict(dict,orient='index',columns=idx)

hamlet    lear falstaff      puck

scene    foul  murder  drunken  intrigue

facade   fair  beaten      fat       elf

Related reading

 

]]>
Pandas: How To Read CSV & JSON Files https://www.bmc.com/blogs/pandas-read-json-csv-files/ Fri, 20 Nov 2020 08:18:54 +0000 https://www.bmc.com/blogs/?p=19330 Here we show how to load CSV files and JSON files into a Pandas dataframe using Pandas. (Brand new to Pandas? Get the basics in our Pandas introduction.) This illustrates, yet again, why Pandas is so powerful. It does all the heavy lifting of downloading a file from the internet, opening it, looping through it, […]]]>

Here we show how to load CSV files and JSON files into a Pandas dataframe using Pandas. (Brand new to Pandas? Get the basics in our Pandas introduction.)

This illustrates, yet again, why Pandas is so powerful. It does all the heavy lifting of downloading a file from the internet, opening it, looping through it, parsing it, and converting it to a dataframe. And it does it in a single line of code.

The Jupyter notebook for this code is here. You will need to pip install pandas if you don’t already have that.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

How to read a CSV file with Python Pandas

Pandas can open a URL directly. That means you don’t need to download a file to read it. Below we read a .csv file:

import pandas as pd

url = 'https://raw.githubusercontent.com/werowe/logisticRegressionBestModel/master/KidCreative.csv'

df = pd.read_csv(url, delimiter=',')

Then look at the top of it:

df.head()

The results look like this. As you can see, it parsed the file by the delimiter and added the column names from the first row in the .csv file.

How to read a JSON file with Pandas

JSON is slightly more complicated, as the JSON is deeply nested. Pandas does not automatically unwind that for you.

Here we follow the same procedure as above, except we use pd.read_json() instead of pd.read_csv().

Notice that in this example we put the parameter lines=True because the file is in JSONP format. That means it’s not a valid JSON file. Rather it is a file with multiple JSON records, one right after the other.

import pandas as pd

url = 'https://raw.githubusercontent.com/werowe/logisticRegressionBestModel/master/ct1.json'

dfct=pd.read_json(url,lines=True)

Now look at the dataframe:

dfct.head()

Results in:

Notice that Pandas did not unwind the location JSON object.  The input JSON looks like this:

{

            "state": "CT",
            "postcode": "06037",
            "street": "Parish Dr",
            "district": "",
            "unit": "",
            "location": {
                        "type": "Point",
                        "coordinates": [-72.7738706, 41.6332836]
            },
            "region": "Hartford",
            "number": "51",
            "city": "Berlin"
}

So, we need an additional step. We turn the elements in location into list and then construct a DataFrame from that

pd.DataFrame(list(dfct['location']))

Results in a new dataframe with coordinate and type:

Related reading

]]>
Pandas Introduction & Tutorials for Beginners https://www.bmc.com/blogs/pandas-basics/ Wed, 11 Nov 2020 00:00:49 +0000 https://www.bmc.com/blogs/?p=19157 This article introduces you to Pandas, a data analysis library of tools that’s built upon Python. We will: Look briefly at the tool Show you how to perform basic operations (This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.) What is pandas? A Python Pandas dataframe is more than an […]]]>

This article introduces you to Pandas, a data analysis library of tools that’s built upon Python. We will:

  • Look briefly at the tool
  • Show you how to perform basic operations

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

What is pandas?

A Python Pandas dataframe is more than an array data structure. Pandas is a powerful tool that lets you:

  • Convert JSON, CSV, array, dictionaries, and other data to row and column format
  • Work with them using names instead of indexes (you can still opt for indexes)

In short, Pandas is sort of like a spreadsheet, but one you work with using code, not Microsoft Excel. The biggest benefits:

  • Pandas makes extremely complicated data transformations easy and natural.
  • It includes a wealth of math, analytics, and other functions.

How pandas works

Pandas is built on top of NumPy and Matplotlib. So, Pandas can:

  • Efficiently work with large n-dimensional arrays (NumPy)
  • Take slices and transpose those into different shapes (NumPy)
  • Draw charts (Matplotlib)

NumPy is the workhouse for most Python machine learning software development kits (SDKs). Since Pandas extends NumPy, it also supports machine learning operations.

Basic pandas operations

Now, let’s transition into an easy tutorial that shows you the Pandas basics.

Create a dataframe from an array

First create a dataframe from an array.

This is a 2×2 array (meaning its shape is 2×2). That’s two rows and two columns. The column names array must have two elements. Here, we put student and grade.

import pandas as pd

df = pd.DataFrame([["Fred",80],["Jill",90]],columns=["student", "grade"])

Then type df to see it. In a Jupyter Notebook, the display is formatted. (Below, we create a chart so you will need to use Jupyter, since Jupyter supports graphics.)

df


The dataframe index is just the row count, 0 and 1. It would be more natural to use the student name as the index. Use set_index to do that.

Normally Pandas dataframe operations create a new dataframe. But we can use inplace=True in some operations to update the existing dataframe without having to make a new one.

df.set_index("student",inplace=True)

Now it looks like this:

Add a column to a Pandas dataframe

Let’s add a column to the Pandas dataframe. This process you can do in place. It expects two values since we have two rows. We just dataframe[‘new column name’] to add the new column. It inserts the new column into the existing dataframe.

df['birthdate']=['1970-01-12', '1972-05-12']

Filter dataframe by column value

Here we select all students born on 1970-01-12:

df[df['birthdate']=='1970-01-12']

Produces:

Pandas Series: Select 1 column from dataframe

Here we select one column. This is not called a dataframe, but a series. It’s basically a dataframe of one column. But it’s a different type of object, so it has slightly different methods.

grade=df['grade']

Notice that index is still the student name. Pandas tells us that grade is of type int64—a 64-bit integer. This is because it uses NumPy, which supports lots of numeric types. Regular Python only supports integers and floats. So NumPy emulates them, just like the Python decimal object emulates decimal numbers.

student
Fred    80
Jill    90
Name: grade, dtype: int64

Add rows to a pandas dataframe

Let’s add some more students.

Here we create a new dataframe and append it to the existing one creating a new one, df3. In this example, in df2 we specifically give Pandas the student names as index values instead of doing that using set_index, as we did above.

df2 = pd.DataFrame([[70,'1980-11-12'],[97, '1984-11-01']],index=["Costas", "Ilya"], columns=["grade", "birthdate"])

df3=df.append(df2)

Now we have some more students:

Select Pandas dataframe rows by index position

Here we select the first two rows using iloc, which selects by index offset.

df3.iloc[0:2]

Produces:

Pandas map function & scatter chart

Just to illustrate what else Pandas can do, let’s make a scatter chart. We will plot age by grade.

First we need to convert the birthdate to a number. We will make it of Numpy field of type datetime64 using:

bday=pd.to_datetime(df3['birthdate'])

bday is a series.

Then let’s calculate today’s date:

from datetime import datetime 
import numpy as np

today = datetime.now()

Then we show how to use the map function. That runs over every row in the dataframe or series.

Someone’s age is today’s date minus their birthdate. That subtraction gives us a timedelta object, so we divide it by 365 days per year np.timedelta64(365, ‘D’) to give a very close estimate of their age. (Not all years have 365 days.) If we did not do that the age would be a time delta object and not a single integer value.

bday.map(lambda l: int((today-l)/np.timedelta64(365, 'D')))

df3['age']=bday.map(lambda l: int((today-l)/np.timedelta64(365, 'D')))

Now it looks like this:

Now we illustrate how Pandas includes Matplotlib by plotting grade versus age. We tell it what column to use for the x and y axis as well as the color for the dots.

df3[['grade','age']].plot.scatter(x='grade',
                      y='age',
                      c='DarkBlue')

Here is the chart:

Show correlation between columns

Just to illustrate one more feature, let’s see if age is correlated with grade. Of course, it’s not, but let’s just show that Pandas has this and many other advanced capabilities.

df3[['grade','age']].corr()

So, you can see that obviously grade is perfectly correlated (1.0) with itself but not at all with age (< 0):

Additional resources

For related reading and tutorials, explore these resources:

]]>