Machine Learning & Big Data Blog

Pandas: How To Read CSV & JSON Files

How to use mongodump in MongoDB
2 minute read
Walker Rowe

Here we show how to load CSV files and JSON files into a Pandas dataframe using Pandas. (Brand new to Pandas? Get the basics in our Pandas introduction.)

This illustrates, yet again, why Pandas is so powerful. It does all the heavy lifting of downloading a file from the internet, opening it, looping through it, parsing it, and converting it to a dataframe. And it does it in a single line of code.

The Jupyter notebook for this code is here. You will need to pip install pandas if you don’t already have that.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

How to read a CSV file with Python Pandas

Pandas can open a URL directly. That means you don’t need to download a file to read it. Below we read a .csv file:

import pandas as pd
url = 'https://raw.githubusercontent.com/werowe/logisticRegressionBestModel/master/KidCreative.csv'
df = pd.read_csv(url, delimiter=',')

Then look at the top of it:

df.head()

The results look like this. As you can see, it parsed the file by the delimiter and added the column names from the first row in the .csv file.

How to read a JSON file with Pandas

JSON is slightly more complicated, as the JSON is deeply nested. Pandas does not automatically unwind that for you.

Here we follow the same procedure as above, except we use pd.read_json() instead of pd.read_csv().

Notice that in this example we put the parameter lines=True because the file is in JSONP format. That means it’s not a valid JSON file. Rather it is a file with multiple JSON records, one right after the other.

import pandas as pd
url = 'https://raw.githubusercontent.com/werowe/logisticRegressionBestModel/master/ct1.json'
dfct=pd.read_json(url,lines=True)

Now look at the dataframe:

dfct.head()

Results in:

Notice that Pandas did not unwind the location JSON object. The input JSON looks like this:

{
            "state": "CT",
            "postcode": "06037",
            "street": "Parish Dr",
            "district": "",
            "unit": "",
            "location": {
                        "type": "Point",
                        "coordinates": [-72.7738706, 41.6332836]
            },
            "region": "Hartford",
            "number": "51",
            "city": "Berlin"
}

So, we need an additional step. We turn the elements in location into list and then construct a DataFrame from that

pd.DataFrame(list(dfct['location']))

Results in a new dataframe with coordinate and type:

Related reading

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing [email protected].

About Us

As BMC and BMC Helix, we are committed to a shared purpose for customers in every industry and around the globe. BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible by automating critical applications, systems, and services to take advantage of cloud, data, and emerging AI technologies. BMC Helix, now operating as an independent company, helps the world’s most forward-thinking IT organizations turn AI into action—unlocking human potential to multiply productivity so teams can focus on the work that matters most.
Learn more about BMC and BMC Helix ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.