Machine Learning & Big Data Blog

Using StringIO to Read Delimited Text Files into NumPy

Walker Rowe
2 minute read
Walker Rowe
image_pdfimage_print

In this tutorial, we’ll show you how to read delimited text data into a NumPy array using the StringIO package.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Data we used

We will read this crime data:

,crime$cluster,Murder,Assault,UrbanPop,Rape
Alabama,4,13.2,236,58,21.2
Alaska,4,10,263,48,44.5
Arizona,4,8.1,294,80,31
Arkansas,3,8.8,190,50,19.5
California,4,9,276,91,40.6
Colorado,3,7.9,204,78,38.7
Connecticut,2,3.3,110,77,11.1
Delaware,4,5.9,238,72,15.8
Florida,4,15.4,335,80,31.9

Parameters

In the code below, we download the data using urllib. Then we use np.genfromtxt to import it to the NumPy array. Note the following parameters:

delimiter=”,” The delimiter between columns.
skip_header=1 We skip the header since that has column headers and not data.
dtype=dtypes This parameter means use the tuples (name, dtype) to convert the data using the name as the assigned numpy dtype (data type).

If we don’t want to assign names we would use (dtype1, dtype2, …).

Note that we use the type float. Since NumPy is built using the C language, you can use any of the many ctypes, like 32 bit integers etc.

We use S12 for str as str converts this data to ” “. You could also use unicode U12.

We also could have written np.string_ and np.unicode_ but that does not give any length, so it means a null terminated byte, which is not a string. So, it would return a blank space.

We could have used object as well.

Note that NumPy uses these names:

·        dtype=[(‘crime’, ‘S12’), (‘cluster’, ‘<f8’), (‘Murder’, ‘<f8’), (‘Assault’, ‘<f8’), (‘UrbanPop’, ‘<f8’), (‘Rape’, ‘<f8’)])

·        The < sign refers to the byte order which can be little-endian or big-endian.

usecols=(1,5) We did not use this parameter. If we had used it, it would have skipped the first column.

The code explained

Here is the code:

import urllib
import numpy as np
from io import StringIO
url = "https://raw.githubusercontent.com/werowe/MLexamples/master/crime_data.csv"
file = urllib.request.urlopen(url)
data = ""
for d in file:
data = data + d.decode('utf-8')
dtypes=[('crime',"S12"),
('cluster', float),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]
arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1,
dtype=dtypes)

Results in:

array([(b'Alabama', 4., 13.2, 236., 58., 21.2),
(b'Alaska', 4., 10. , 263., 48., 44.5),

Note that NumPy returned a byte array for the string column. If we want a string, we can use Unicode:

dtypes=[('crime','U25'),
('cluster', '>f'),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]

Results in:

array([('Alabama', 4., 13.2, 236., 58., 21.2),
('Alaska', 4., 10. , 263., 48., 44.5),

If we leave off dtypes and let NumPy pick the data types, it NaN (missing data) to the string column. It also uses float as the default for all numeric values.

arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1)

Results in:

array([[  nan,   4. ,  13.2, 236. ,  58. ,  21.2],
[  nan,   4. ,  10. , 263. ,  48. ,  44.5],

Having assigned names to columns we can refer to their name instead of index:

arr['Murder']
array([13.2, 10. ,  8.1,  8.8,  9. ,  7.9,  3.3,  5.9, 15.4, 17.4,  5.3,
2.6, 10.4,  7.2,  2.2,  6. ,  9.7, 15.4,  2.1, 11.3,  4.4, 12.1,
2.7, 16.1,  9. ,  6. ,  4.3, 12.2,  2.1,  7.4, 11.4, 11.1, 13. ,
0.8,  7.3,  6.6,  4.9,  6.3,  3.4, 14.4,  3.8, 13.2, 12.7,  3.2,
2.2,  8.5,  4. ,  5.7,  2.6,  6.8])

Missing values

We can tell NumPy to plug in a value for a missing value, like -1, using missing_values. The default behavior for floats is np.nan. For int it is -1.

Alaska,4,10,263,48,44.5
Arizona,4, ,1,294,80,31

That concludes this tutorial.

Related reading

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

BMC Bring the A-Game

From core to cloud to edge, BMC delivers the software and services that enable nearly 10,000 global customers, including 84% of the Forbes Global 100, to thrive in their ongoing evolution to an Autonomous Digital Enterprise.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.